江苏师范大学江苏高校优势学科概率统计前沿系列讲座之三十五

发布时间:2014-10-27   浏览次数:132

 

吴启宏 教授

报告题目THE NATURE OF SPURIOUS CORRELATIONS

报告时间201410 29日(周三 下午2:30-3:30

报告地点江苏师范大学数学与统计学院学术报告厅(静远楼1506室)

报告摘要:In a read paper of Royal Statistical Society which was commented by 41 discussants, Fan and Lv (2008) considered the problem of screening a set of random predictors (X1;X2; : : : ;Xp) for response Y; base on realistically small sample size . To illustrate that “When p is large, some of the intuition might not be accurate” in the Introduction, the authors reported the so-called “spurious correlations” or “noise accumulation” from simulations. The phenomena without theoretical explanation were again reported in Journal of Machine Learning Research by Fan, Samsworth and Wu (2009, p.2014), in an Invited Review Article in Statistica Sinica by Fan and Lv (2010, p.102), in Annu. Rev. Econ. by Fan, Lv and Qi (2011, p.239, p.294), and in J. R. Statist. Soc., B by Fan, Guo and Hao (2012, p.39). It is about the seemingly excessive values of the largest absolute correlation coefficient in their simulations, even if the p+1 variables are known to be independently standard normal. Such intuitively surprising phenomena have important consequences, particularly in biostatistics and genetics where the variables (features) in scope of study are usually in thousands or tens of thousands while the sample size in tens or hundreds. The topic and its discussions have been cited in nearly a thousand publications according to Google Scholar. This seminar shows the theoretical genesis of such “spurious correlations” in a framework more general than normality, where the p independent samples, respectively are such that each of the relocated vectors has a spherically symmetric density function of dimension n from a possibly different family with a possibly different scaling parameter, while the independent vector y has an arbitrary density function of dimension n. Under these assumptions, the squared largest sample correlation coefficient is distributed as the maximum of p random variables which are independently and identically distributed as Beta() distribution. And after controlling q of the X-variables where q < n 2, the result for the remaining p q sample partial correlation coefficients is as in the simple correlation case but with sample size n reduced to n q. The proof assuming normality can be incorporated into undergraduate curriculum without difficulty. Some direct applications and implications are provided.

吴启宏教授简介:

吴启宏(Kai Wang Ng)教授是香港大学统计及精算系前系主任,名誉教授。美国统计学会会员,英国皇家统计学会Follow,泛华统计协会终身会员,香港统计学会终身会员。现为“Insurance: Mathematics and Economic”“Case Studies in Business, Industry and Government Statistics”等杂志副主编。研究领域包括 Foundation of inference. Converse of Bayes' Theorem and applications. Distribution theory. Actuarial & financial risk. Applications of asymptotic theory. Multivariate analysis. Linear models. Data mining & Informatics.