March 29th, 2013, 7:37 pm
Hi Giobilkis.Sorry for not getting your point before.Most of the tutorials on PCA I have come though command to substract the mean, without really explaining at all.Eg:Principal Components Analysis Using R - P1 (2'50")A tutorial on Principal Components Analysis (p.12)But, there is a nice reply at mathworks:QuoteWell, in theory, you don't absolutely need to center thedata.In theory, you should also always center the data.Do these two statements seem to conflict? (Yes.) I'll explain below.First of all, how do you center the data? By subtractingoff the mean. Assume that your data is stored in astandard form, an nxp array, where p is the number ofdimensions in your data set, and n is the number ofdata points in the array. Subtract off the columnmeans from each column. Essentially we are makingeach variable have a mean of zero. We can also thinkof this as a transformation of the space to have a neworigin. This will do the centering... data_cent = bsxfun(@minus, data,mean(data,1));The covariance matrix is given by covmatrix = (data_cent'*data_cent)/(n-1);or, we could have done it as just covmatrix = cov(data);The cov function does the centering for us, but itis valuable to understand what we are doing. Wecan apply a PCA tool to the centered data. This isas simple as computing the eigenvectors andeigenvalues of the covariance matrix, alternatively,compute the right singular vectors and the squareof the singular values of the centered data array,using svd.Suppose we did not do the centering? What wouldhappen? Well, as it turns out, the world will notend. We will indeed get a result, but it will not beas useful. A PCA applied on what is known as thesecond moment matrix, i.e., data'*data, will havethe mean of the data built into the eigenvectors.(Surprisingly, this is occasionally the correct thingto do!) Mainly, the first eigenvector will look verystrange. In fact, the first eigenvector will probablylook very much like the mean vector itself. The restof the vectors will be very different too.Of course, if you use princomp from the statstoolbox, it appears to automatically center the datafor you. No fun at all.There is the final distinction of whether you do thePCA on a scaled data matrix. If the columns arescaled to have unit standard deviations, then aPCA will be based on the correlation matrix. Ineffect, this gives each variable the same weight asall others. That can be important if the variables inyour data set have very different units and relativescalings.HTH,JohnI think this is the clarification you are after.
Last edited by
tags on March 28th, 2013, 11:00 pm, edited 1 time in total.