Average of Statistics and Principal Component Analysis

pcerutti · September 25th, 2014, 7:40 am

Dear All ,I have a very big class of equity returns (about 500 time series each of 500 daily observations) from which to infer some statistics to detect outliers (sample mean, sample deviation, quartiles, median absolute dev and so on). Once I have calculated for each time series the above statistics, I need to aggregate them in a single measure to detect outliers valid for all the class.Question 1: I have 500 daily means, daily standard deviations, quartiles (Q1, Q3 and IQR) for interquartile analysis and median absolute deviations (MAD). I can easily calculate the average of the 500 means and the average of standard deviations using Average Std = Sqrt (Average of Variances) but I don?t know how to aggregate in a statistically consistent way the other statistics (Q1, Q3, IQR, MAD). A simple mean of 500 of the above statistics does not seem statistically correct. Any formulas or ideas to solve this problem? Question 2: In case I use Principal Component Analysis to reduce the dimension of my asset class, is it better to estimate the above statistics directly from the first n principal components (for example, PC1 to PC10) or it is better to estimate from the principal component approximation of the original returns calculated as in (II.2.5) from the book "Market Risk Analysis vol. 2" by Carol Alexander? Again, how can I aggregate the ten statistics I get (Q1, Q3, IQR and MAD for 10 time series) in an average Q1, an average Q3, an average IQR and an average MAD? In case this average aggregation does not make any statistical sense, can I calculate my statistics just on the first principal component PC1 if it explains enough variation in my original returns (so to avoid any aggregation problem)? In general, any statistical analysis about dispersion of my original returns is better made on a given number of first principal components or on the principal component approximation of the original returns?Thank you very much for reply to all my questions.Pier

pakhijain19 · September 30th, 2014, 6:58 am

pakhijain19 · September 30th, 2014, 6:59 am

a method of analysis which involves finding the linear combination of a set of variables that has maximum variance and removing its effect, repeating this successively.

pcerutti · September 30th, 2014, 7:03 am

Any paper about this method?Wish to reply to all my questions one by one in a deeper way?Thank you very much.

daveangel · September 30th, 2014, 7:09 am

QuoteOriginally posted by: pceruttiAny paper about this method?Wish to reply to all my questions one by one in a deeper way?Thank you very much.or just quote from the Oxford Dictionaries .... pakhijain19 is a spammer. Admin - are you going to take any action ?

neuroguy · October 22nd, 2014, 6:40 am

1) This slightly depends on why you are looking for outliers and what you are looking for outliers relative to. Assuming that you are looking for outliers in return statistics where by outlier you mean the return statistic relative to some population, then the standard approach is to take a Z score (datapoint-population mean/population standard dev.). This is a specific example of the class of 't-statistics' that aim to regularise estimators to permit comparison between then.In the specific case of stocks however one normally performs stratification first. For example you might look at the z score of Ford relative to global (or even just American) auto manufacturers. Thre reason for this is to remove systemic factors from your comparison, which could dominate it, but that you may not actually be interested in. For example you might find that several Aussie mining stocks end up looking like global outliers, but this might be caused by, for example, dynamics of the USD/AUD cross, which is probably not actually an active bet you want in your portfolio. But if you dont stratify then naive outlier analysis might make you think you are long/short mining when actually you are long short a currency pair (this is an overly simple example but serves to illustrate the point). [For discussion of this see Quantitative Equity Portfolio Management, Chincarini & Kim, McGraw-Hill.] For MAD specifically you might be able to construct a Z-score analogue (since MAD is just a median based dispersion measure) along the lines of (datapoint-population median/MAD).ASFAIK: Fractiles (eg. quartiles) are not really used for outlier analysis. They are normally used to impose an ordering on some datapoints once those datapoints have already been regularised: eg. you would fractile stock returns after you have normalised them (eg with z score) and excluded any erroneous/extreme data points.2) You can apply any statistics you want to a return series that is constructed from the principal components. Remember, PCA is just a projection in to a basis that is aligned with the first n directions of maximal variance. So you can write any PCA return series r_pca in terms of the original r_0 as:r_pca = beta_p1*p1 + beta_p2*p2 + ... + beta_pn*pnwhere p1 is the first principal component of the population returns, beta_p1 = cor(r_0,p1) (and so on) and in the above I have negelected the 'alpha' and specific innovation terms for brevity.Now you can compute any statistics you want from the r_pca representation. Note that this method is statistically 'correct' if your data is some multidimensional Gaussian cloud. In so far as your data is not like this (which it wont be) then there is a loss of representation of any higher order statitistics. For example you can always find PCA for a cloud of data, but PCA does not gaurantee that the data is not covariant within the PCA basis. Specifically for finance this can mean that there is a feeling that the PCA allows you to invest in 'separate' components of risk, but really you are investing in 'separate' components after applying the implicit assumption that those components are orthogonal in the PCA basis. Hence there can be unintended risk due to the unrecognised covariance of data even within the PCA basis (typically due to non-linearities or 'dislocations'). You can gain an understanding of how many components to use by looking at the eigenvalues of the covariance matrix (since PCA is merely constructed from the orthonormal basis implied by this matrix). The eigenvalues will 'decay' and one method to chose the number of components is to examine this spectrum. Since PCA is explicitly throwing away information in the original data you should perform dispersion analysis on the original data (not any PCA representation of it). Usually you would only build such a 'statistical' model once you are happy that you understand the statistics (at least historically) and that it is at least a sensible tool to apply.

pcerutti · October 22nd, 2014, 8:51 am

Thank you very much Neuroguy.

MBSADVISORS · November 10th, 2014, 5:47 pm

5.8% Unemployment Rate; A Sign of Better Economic Times,Or A Greater Fools Paradise?Last week's unemployment rate is a perfect example of the short comings with most risk models being used today that are heavily dependent on statistical data of the past. If we were to take the most recent unemployment report at face value, one would be tempted to believe we were on the verge of being teleported back to the days when "the Greater Fool Theory" thrived; Monte Carlo risk simulation models could do no wrong and hourly wages / bonus increases had no limits. So why doesn't it feel that way; despite all the reported economic hype and exuberance surrounding the stock market after last week's unemployment report?In an effort to avoid prolonged debates regarding the accuracy behind the methodology of the most recent unemployment report, we may wish to approach the subject from a different angle. As it is widely known; a higher Labor Participation Rate index (LPR) has a direct effect on the total amount of disposable income in circulation; which fuel's the economy's growth.Thus, the opposite is true when the LPR numbers begin to recede. At the peak of the last economic expansion the LPR was approximately 66. At the present time the LPR is at a 30 year low of 62.8. Click here: http://data.bls.gov/timeseries/LNS11300000.This would seem to explain why the underpinnings of the present economy's growth; doesn't feel nearly as strong as it did the last time the unemployment rate was below 5.9%.A troubling note: Since the third quarter of 2013 the HCTI ratio has increased from 5.3 to 5.9, which has had a negative effect on LPR. As a consequence; the LPR may continue to be flat for the remainder of the year. The Feds departure from the bond market will not only lead to higher mortgage rates; It may also lead to higher HCTI, if hourly wages do not increase in tandem. Our new risk model formally know as The Economic Genome (EG) takes certain behavioral input values into account focusing on future risk appose to reporting the past data inputs which have no little to no value. View EG model and methodology here https://economicgenome.blogspot.com We are seeking input from the Quant community with regard the benefits of implementing the EG model within their own models.

pcerutti · November 12th, 2014, 1:24 pm

???????????????????????????

neuroguy · November 13th, 2014, 6:11 am

QuoteOriginally posted by: MBSADVISORS5.8% Unemployment Rate; A Sign of Better Economic Times,Or A Greater Fools Paradise?Last week's unemployment rate is a perfect example of the short comings with most risk models being used today that are heavily dependent on statistical data of the past. If we were to take the most recent unemployment report at face value, one would be tempted to believe we were on the verge of being teleported back to the days when "the Greater Fool Theory" thrived; Monte Carlo risk simulation models could do no wrong and hourly wages / bonus increases had no limits. So why doesn't it feel that way; despite all the reported economic hype and exuberance surrounding the stock market after last week's unemployment report?In an effort to avoid prolonged debates regarding the accuracy behind the methodology of the most recent unemployment report, we may wish to approach the subject from a different angle. As it is widely known; a higher Labor Participation Rate index (LPR) has a direct effect on the total amount of disposable income in circulation; which fuel's the economy's growth.Thus, the opposite is true when the LPR numbers begin to recede. At the peak of the last economic expansion the LPR was approximately 66. At the present time the LPR is at a 30 year low of 62.8. Click here: http://data.bls.gov/timeseries/LNS11300000.This would seem to explain why the underpinnings of the present economy's growth; doesn't feel nearly as strong as it did the last time the unemployment rate was below 5.9%.A troubling note: Since the third quarter of 2013 the HCTI ratio has increased from 5.3 to 5.9, which has had a negative effect on LPR. As a consequence; the LPR may continue to be flat for the remainder of the year. The Feds departure from the bond market will not only lead to higher mortgage rates; It may also lead to higher HCTI, if hourly wages do not increase in tandem. Our new risk model formally know as The Economic Genome (EG) takes certain behavioral input values into account focusing on future risk appose to reporting the past data inputs which have no little to no value. View EG model and methodology here https://economicgenome.blogspot.com We are seeking input from the Quant community with regard the benefits of implementing the EG model within their own models.In so far as you care... dont spam threads. And rather than write a half baked blog post, why not actually write a paper? You will get more milage.From your blog:"The goal being, to create an algorithm, which will achieve a sustained stabilized economy as experienced in the mid-1940s throughout the 1950s"What makes you think this is possible? The above statement assumes a kind of statistical closure that does not exist in the real world. The best you can hope for is an economic model that is efficient given the information at any given time. The paradox here is that such an ecnomy will actually, at times, be more violent than an inefficient one, not less. This is because an efficient economy will respond very rapidly to (the constant) unpredicatable events of history. Inefficient economies tend to just go into a protracted heat-death spiral.

MHill · November 13th, 2014, 12:14 pm

QuoteFrom your blog:"The goal being, to create an algorithm, which will achieve a sustained stabilized economy as experienced in the mid-1940s throughout the 1950s"Was the economy of the mid 40s & 50s not characterised by ration cards and the nationalisation of industries? I'm sure that's what my mum told me. Edit:Oh yeah - and cheap loans from America

Cuchulainn · November 13th, 2014, 1:12 pm

QuoteOriginally posted by: MHillQuoteFrom your blog:"The goal being, to create an algorithm, which will achieve a sustained stabilized economy as experienced in the mid-1940s throughout the 1950s"Was the economy of the mid 40s & 50s not characterised by ration cards and the nationalisation of industries? I'm sure that's what my mum told me. Edit:Oh yeah - and cheap loans from AmericaIndeed, we had a ration of tea and sugar during the Emergency 1940-1945.