成人直播

学术研讨会

商务统计与经济计量系系列讲座(2014-18)

发布时间:2014-12-15

Statistics Seminar2014-18

Topic:Phase Transitions for Post-selection PCA

Speaker:Zheng Tracy Ke, University of Chicago

Time:Thursday, 18 December, 14:00-15:00

Location:Room 114, Guanghua Building 1

Abstract:Consider a two-class clustering setting where we observe an n by p data matrix X = LU' + Z. Here, L in R^n is the vector of class labels, U in R^p is the contrast mean vector between two classes, and Z is the matrix of standard Gaussian noise. Both L and U are unknown but U is presumably sparse, and the goal to use X to estimate L (i.e., clustering).

In such settings, an appealing approach is the following two-stage spectral clustering algorithm: In the first stage, for a threshold t to be determined, we select only the small fraction of features with the associated chi-squared-test scores larger than t. In the second stage, we clustering by applying classical PCA to the post-selection data matrix.

A challenging problem is how to set the threshold t. We propose a new way to threshold choice by adapting the recent notion of Higher Criticism (HCT). Combining HCT with post-selection PCA gives an easy-to-implement, tuning free spectral clustering approach.

We reveal an interesting phase transition phenomenon: if we calibrate the contrast mean vector by the sparsity level and the signal strengths, then there are a Region of Impossibility where the signals are so rare and weak so that the post-selection PCA bounds to fail (no matter how we set the threshold t), and a Region of Impossibility where the signals are strong enough so that post-selection PCA succeeds, provided that t is properly set.

We also revel an interesting phenomenon associated with the post-selection PCA. In the Region of Possibility, there are choices of threshold t such that the leading left singular

xi^(t) is approximately distributed as N(A(t)L, I_n), where A(t) is a non-stochastic function of t which we call the post-selection Signal-to-Noise ratio (SNR).

We outline an intimate relationship between HCT and post-selection SNR, and showthat in many rare and weak settings HCT provides an consistent estimate to the ideal threshold|the threshold that maximizes the post-selection SNR.

The work is closely related to the recent idea of Important Features PCA (IF-PCA), but where the focus is on real data applications. The work is also closely related to the recent interest on sparse PCA.

分享

©2017 成人直播-成人直播室 版权所有 京ICP备05065075-1