Statistics Seminar(2014-18)
Topic:Phase Transitions for Post-selection PCA
Speaker:Zheng Tracy Ke, University of Chicago
Time:Thursday, 18 December, 14:00-15:00
Location:Room 114, Guanghua Building 1
Abstract:Consider a two-class clustering setting where we observe an n by p data matrix X = LU' + Z. Here, L in R^n is the vector of class labels, U in R^p is the contrast mean vector between two classes, and Z is the matrix of standard Gaussian noise. Both L and U are unknown but U is presumably sparse, and the goal to use X to estimate L (i.e., clustering).
In such settings, an appealing approach is the following two-stage spectral clustering algorithm: In the first stage, for a threshold t to be determined, we select only the small fraction of features with the associated chi-squared-test scores larger than t. In the second stage, we clustering by applying classical PCA to the post-selection data matrix.
A challenging problem is how to set the threshold t. We propose a new way to threshold choice by adapting the recent notion of Higher Criticism (HCT). Combining HCT with post-selection PCA gives an easy-to-implement, tuning free spectral clustering approach.
We reveal an interesting phase transition phenomenon: if we calibrate the contrast mean vector by the sparsity level and the signal strengths, then there are a Region of Impossibility where the signals are so rare and weak so that the post-selection PCA bounds to fail (no matter how we set the threshold t), and a Region of Impossibility where the signals are strong enough so that post-selection PCA succeeds, provided that t is properly set.
We also revel an interesting phenomenon associated with the post-selection PCA. In the Region of Possibility, there are choices of threshold t such that the leading left singular
xi^(t) is approximately distributed as N(A(t)L, I_n), where A(t) is a non-stochastic function of t which we call the post-selection Signal-to-Noise ratio (SNR).
We outline an intimate relationship between HCT and post-selection SNR, and showthat in many rare and weak settings HCT provides an consistent estimate to the ideal threshold|the threshold that maximizes the post-selection SNR.
The work is closely related to the recent idea of Important Features PCA (IF-PCA), but where the focus is on real data applications. The work is also closely related to the recent interest on sparse PCA.