Stuck in PCA .......!

countblessings · July 2nd, 2013, 2:43 am

Hi All,I am a student, and struggling to understand how PCA helps finding the impact of say, a 1000 exogenous features (x1, x2, ...., x1000) on an endogenous variable (y), beyond the fact they definitely help reduce the dimensionality of the visualization and help us view the overall impact in 2D or 3D space. What's not clear is that the so called "impact" we are visualizing (of the handful of principal components that explain most of the variance), is not the impact of the actual variables x1, x2, ...., x1000, but of the principal components! Each of the new features say, z1, z2, ...., zk (where k << 1000) is a complete transformation of the original features, and each of the principal components is a product of the corresponding eigen vector and the vector of the original features. So, just examining the top k PCs, how do I reverse engineer and conclude which of the original features are the impactful ones? I have googled this up, and there are a variety of answers but none that makes sense to me (maybe I am missing something fundamental), and many students have the same predicament as mine. Can any experts on this forum share their insight please? Most grateful for your responses on this post.

Traden4Alpha · July 2nd, 2013, 11:40 am

PCA was never intended as a way to decide the subset of original features that have the most impact. Instead, it synthesizes a new and smaller set of features (created with orthogonal linear compositions of the original features) that explain the greatest amount of the covariance. If the exogenous features were actually independent of each other, then a PCA would give you the impact of each feature. But if the exogenous features covary with each other, then PCA does not give the impact of the original features (and shouldn't because covariation of the exogenous features suggests that the system has a smaller set of underlying or hidden features that explain the variation of the endogenous variable).If your 1000 exogenous features are NOT clean, independent variables, why do you think you can pull out a useful subset of them?But if you insist on relating the PCA results back to the original features, then you could look at the sum of the squares of the eigenvector weights of each original feature (summed across PCA components) to get a measure of how much the different original features are contributing to those top components. But be warned that any subset of original features selected by ranking the sum-of-squares-of-the-PCA-weights will only explain a fraction of a fraction of the covariance.

countblessings · July 2nd, 2013, 2:26 pm

Traden4Alpha - Pretty impressed with the clarity and comprehensiveness of your response - thank you! Would it be meaningful at all then, to perform a linear (or logistic) regression, or to derive any kind of model (say, using a Neural Network) on the Principal Components vs. the Dependant variable (instead of the 1,000 exogenous features vs. the Dependant variable)? What I am trying to get at is, I appreciate how one reduces dimensionality using PCA, and am now trying to understand where one goes from there. What do we do with the principal components, besides knowing that they contribute to the greatest amount of variance?

Traden4Alpha · July 2nd, 2013, 6:48 pm

It all depends on what you are trying to do. 1. If you want to understand the smallest number of modes of variation explains greatest amount of the variance, then PCA is the way to go. But it won't give you that answer in terms of the original features if those original features are not nicely orthogonal to each other.2. If you want to get rid of spurious variables (e.i., not collect all 1000 features in the future), then I think you can construct hypothesis tests on the mode weights and vector weights to decide which of the 1000 features isn't statistically different from zero in explaining the covariance in the system.3. If you just want some sense of which of the original features matters, then regression or looking at the sum-of-squares-of-the-PCA-weights would offer some insight. But beware that regression will give biased estimates in the case of multicollinearity (which is what you have if the PCA mode vectors are not clean). And beware that the sum-of-squares-of-the-PCA-weights will give you a subset of features that explains a fraction of the variance explained by the subset of modes which is a fraction of the total variance.

countblessings · July 2nd, 2013, 8:31 pm

Thank you for the detailed response. Highly insightful.