Page 1 of 1
PCA: When and why to center data before computing PCs?
Posted: March 29th, 2013, 1:28 pm
by giobilkis
Hey all,Why is it important to center data, i.e. subtract the mean from each data vector prior to computing the principal components matrix?More important question: when it is imperative to center data and when it is not?Thanks,G
PCA: When and why to center data before computing PCs?
Posted: March 29th, 2013, 2:56 pm
by tags
are you talking of (why and when) normalizing the eigenvectors?
PCA: When and why to center data before computing PCs?
Posted: March 29th, 2013, 5:57 pm
by giobilkis
No.The eigenvectors are already normalized in both cases.My question is regarding the centering the original data matrix.1. In my matrix the data vectors are columns and rows are points in time.From each data vector (a column) I subtract the mean of the column thus creating the new column vectors which have zero mean. 2. I calculate the covariance matrix. The matrix is the same for both centered and original data.3. I calculate eigenvalues and eigenvectors of the covariance matrix. The eigenvectors are normalized, i.e of length 1.4. At this stage I compute principal components matrix. In order to do so, I multiply my data with eigenvectors matrix. I have a choice: either I multiply the original data or centered data.5. If I used centered data, then when I reconstruct the original data, using for instance first PC, I have to add back the mean.If I used not centered data, it is straight forward - just multiplication by appropriate transposed eigenvector.The MATLAB help as well as Wikipedia advise strongly to center the data, calculate PCs and then when reconstructing with PCs to add back the mean.The question is why and when it is imperative as compared to straight forward use of original data.
PCA: When and why to center data before computing PCs?
Posted: March 29th, 2013, 7:37 pm
by tags
Hi Giobilkis.Sorry for not getting your point before.Most of the tutorials on PCA I have come though command to substract the mean, without really explaining at all.Eg:Principal Components Analysis Using R - P1 (2'50")A tutorial on Principal Components Analysis (p.12)But, there is a nice reply at mathworks:QuoteWell, in theory, you don't absolutely need to center thedata.In theory, you should also always center the data.Do these two statements seem to conflict? (Yes.) I'll explain below.First of all, how do you center the data? By subtractingoff the mean. Assume that your data is stored in astandard form, an nxp array, where p is the number ofdimensions in your data set, and n is the number ofdata points in the array. Subtract off the columnmeans from each column. Essentially we are makingeach variable have a mean of zero. We can also thinkof this as a transformation of the space to have a neworigin. This will do the centering... data_cent = bsxfun(@minus, data,mean(data,1));The covariance matrix is given by covmatrix = (data_cent'*data_cent)/(n-1);or, we could have done it as just covmatrix = cov(data);The cov function does the centering for us, but itis valuable to understand what we are doing. Wecan apply a PCA tool to the centered data. This isas simple as computing the eigenvectors andeigenvalues of the covariance matrix, alternatively,compute the right singular vectors and the squareof the singular values of the centered data array,using svd.Suppose we did not do the centering? What wouldhappen? Well, as it turns out, the world will notend. We will indeed get a result, but it will not beas useful. A PCA applied on what is known as thesecond moment matrix, i.e., data'*data, will havethe mean of the data built into the eigenvectors.(Surprisingly, this is occasionally the correct thingto do!) Mainly, the first eigenvector will look verystrange. In fact, the first eigenvector will probablylook very much like the mean vector itself. The restof the vectors will be very different too.Of course, if you use princomp from the statstoolbox, it appears to automatically center the datafor you. No fun at all.There is the final distinction of whether you do thePCA on a scaled data matrix. If the columns arescaled to have unit standard deviations, then aPCA will be based on the correlation matrix. Ineffect, this gives each variable the same weight asall others. That can be important if the variables inyour data set have very different units and relativescalings.HTH,JohnI think this is the clarification you are after.
PCA: When and why to center data before computing PCs?
Posted: March 29th, 2013, 8:55 pm
by giobilkis
Thanks.I've read this thing from Matlab (that is exactly what I got in Matlab help search) and the pdf you attached. I might be completely off but that is exactly what I don't understand.To be more precise, I am not sure I understand the following part of the attached (pasted) explanation:"A PCA applied on what is known as the second moment matrix, i.e., data'*data, will have the mean of the data built into the eigenvectors."First, it is a bit embarrassing but I have to ask: What is the second moment matrix - isn't that a covariance matrix?If so, what does it mean - "the mean of the data built into the eigenvectors."At least I am doing something completely wrong, the covariance matrix is exactly the same for both centered and not centered data.Which means that eigenvectors and eigenvalues are exactly the same for the both centered and not centered data.The difference is only within which data matrix I use later on to calculate principal components.As to the Matlab function, it does indeed the centering automatically and this forces me to add the mean back when I reconstruct the original data using only the first principal component. According to all the sources it is the right way to go. I just cannot rationalize it neither to myself nor to other people.G
PCA: When and why to center data before computing PCs?
Posted: March 30th, 2013, 1:14 pm
by acastaldo
What do we mean by "moment"
http://mathworld.wolfram.com/Moment.htmlThe second moment (or more precisely raw second moment) is E[x^2]The centered second moment is E[(x-mu)^2] aka covariance matrix
PCA: When and why to center data before computing PCs?
Posted: March 30th, 2013, 2:38 pm
by giobilkis
Then I really fail to understand what is the relevance of this sentence in the Matlab help forum quote:"A PCA applied on what is known as the second moment matrix, i.e., data'*data, will have the mean of the data built into the eigenvectors.".I apply PCA to an original data - interest rates, equity returns, etc..., not to second moment or anything near that.And my question is regarding the centering of the original data. In both cases I get the very same covariance matrix. Same eigenvalues. Same eigenvectors.PCs will be different based on the centering or not centering of the data.
PCA: When and why to center data before computing PCs?
Posted: April 10th, 2014, 11:37 am
by calvinkit
I've been looking for an answer for this awhile (including reading the answer from Matlab help message as suggested above). After reading the above post, I've decided to give my opinion (even this is an old post)Say D is my original data. M is the mean of the D, with each column in M is the same number (i.e. mean of the corresponding column in D). As pointed out by the original author, Covariance (C) and hence the eigen vector/matrix (P) are the same whether or not you apply eigen decomposition. If we apply PCA on original data, using all P, it will be:D' = D*P*transpose(P) and we know D = D'If we apply PCA on zero mean data, using all P, it will be:(D-M)*P*transpose(P) = D*P*transpose(P)-M*P*transpose(P) = D-M (obviously)But if we only use the first L of the P, i.e. PL, then the M (as well as D is not going to be recovered. But it should be highly resemblence to the original M, if the PL capture most of the variance.ie. D != D', M != M'Since they are not the same, the question is, should I apply to zero mean data or not?The answer is yes, especially if you intend (which you mostly would) to do analysis on residual variance or zscore type of analysis. Residual (R) defined as D-D', when applied on original data. When M is large, the residual on M will distort/overwhelm the residual on D, hence the zscore calculated or overall residual calculated is not representing the "variance" of your data, but including some noise as well. I think that's the reason one would want to apply and analyse the zero mean dataset at the end.
PCA: When and why to center data before computing PCs?
Posted: April 10th, 2014, 12:01 pm
by acastaldo
very good. thanks.
PCA: When and why to center data before computing PCs?
Posted: April 10th, 2014, 1:47 pm
by Dantas
I thought the reason why you center the data or not depends if the data is stationary or not.
PCA: When and why to center data before computing PCs?
Posted: April 11th, 2014, 9:34 am
by chewwy
Yeah it's just to (attempt to) strip out the drift/time inhomogeneityOtherwise my first component does really tell me anything interesting