How would you detect outliers in an online fashion for variance estimation? Simple and practical methods please, like corrections to a z-score-test, so as to cover semiheavy tails and heteroschedasticity/exp weighting. And how does one derive a consistent fresh warmup/initialization to get the at-regime behaviour of a longer dated online run? You can't just pad your data with zeros, for dispersion.Yes, sample variance is not our preferred choice, if you can do other [robust] dispersions online that'd be nice to hear too.

- Traden4Alpha
**Posts:**23951**Joined:**

By "online," I assume you want something that can judge P(outliner) for the most current value of a time series. It also seems you want outlier detection when the distribution is non-Gaussian.How about:1. Rank (i.e., what is the rank of the current value WRT some representative historical dataset) is totally non-parametric and great for arbitrary (but IID) data. But it won't work well if volatility varies over time (it will have lots of false positives during times of high vol and miss outliers during times of low vol)2. Rank of the short-term Z-score is more robust to heteroschedasticity but then you get into the game of tuning the duration of the estimators for mean and variance for the Z value. Personally, I love exponential moving estimators for mean and variance because: they forget smoothly; converge relatively quickly to any step-change in the true value of the estimated variable; and the estimator is lightweight in both memory and CPU.Of course, anything with rank has the computational issue of holding a window of data. And if your data has heavy-tails, then you want a window big enough to encompass a statistically meaningful number of tail events. YUCK!3. Track E(Z > x) for a series of x values such as {1, 2, 3, 4, 5}. E(Z > x) is an estimator for the CDF of the positive tail (at a discrete set of points) which you can use to estimate P(outlier) in that direction. You'll also want a E(Z < x) with x<0 to empirically model the negative tail or use E(|Z| > x) if your distribution is symmetric. Doing this "right" can be hard because the window size (or duration of the exponential moving average) needs to be tuned to the frequency of events.The warm-up is not really possible unless one has some nice parametric model for the distribution (and the parameters of that model stabilize at low-sample sizes). By definition, tail events are rare which implies that characterizing them can take a lot of samples. If you have independent data streams that all have the same semiheavy tail distribution (perhaps with the exception of some trivial location or scale parameter), then you could use normalized cross-sectional data to more quickly build an empirical model of the tails. P.S. If the tails of your distribution follow some curve that is linear in some transformed variable space (e.g., the log-log transform for power law tails), then you can create an online curve-fit for the tails.

Last edited by Traden4Alpha on February 17th, 2014, 11:00 pm, edited 1 time in total.

QuoteBy "online," I assume you want something that can judge P(outliner) for the most current value of a time series. It also seems you want outlier detection when the distribution is non-Gaussian.Thanks for the tips. Ideally with "online" I also mean doing it in O(1) time and space for each input value, e.g. we might need different volatilities for each covariance so anything more would blow up... This rules out rolling window ranks, but rank against a "fixed" shared datased would work fine... oh well, that's more or less the same as mantaining a cutoff value in the end... but unfortunately we have SV/decay.We also considered 3. which is somewhat dual to tracking a sorted list, but for the tails it's problematic, and again has problems with SV unless your subdivision is fine enough to be impractical... exponential decay would be O(m) (with m thresholds), which then would be expensive. There are also online quantile tracking algos, but they're not immediate to adapt to exponential weighting in an efficient way.(In parallel I'm looking at alternatives avoiding outlier filtering altogether, via heavytail-robust estimators, but we need a quick and dirty solution in the meantime.)QuoteP.S. If the tails of your distribution follow some curve that is linear in some transformed variable space (e.g., the log-log transform for power law tails), then you can create an online curve-fit for the tails.Yep that'd be nice but how does one detect (and mantain) the beginning of the tail? And it all needs to be online...Oh well, if you guys don't have a solution at hand for this I'd also be curious to know how can you live without one All batch anew each time?

- Traden4Alpha
**Posts:**23951**Joined:**

Hmmmm... I do have a question.... Are you trying to model a system that is a mix of a time-varying-SV Gaussian with added "outlier" tail events?If so then, instead of modeling the tails, you might consider modeling the body the empirical distribution using a series of trimmed means. If you track something like {E(-0.5 < Z < 0.5), E(-1 < Z < 1), E(-1.5 < Z < 1.5)} and use that to re-estimate SV (and modulate future Z calculations), you'll have a fairly robust predictor-corrector for SV. Admittedly, this estimator is biased unless you also correct for the reduced arrival rate of body samples due to outliers (something like: SV_true = SV_bodysample*(1-P(outlier)) ) which you can probably estimate by modeling the slope or shape of the empirical tail versus the expected slope or shape of the tail estimated from the body samples under the assumption of normality. The body sample estimator should be relatively fast but any modeling of the tails is going to be very slow due to the natural rarity of these events.Of course, if the rate of outliers varies in time, it's going to be messy!

I have had some success using the ratio of the Cornish Fisher Modified downside probability to the normal probability of loss as a deviation from normality indicator. It is my belief that tail deviations don't all happen instantaneously and that you often get sufficient precursor movement, probably as a result of crowding and bubble formation, to identify left tail events before they occur. The regimes are obviously time varying so it is difficult to find a set of universal calibrations ( loss threshold and lookback period ) that works for all periods so still a work in progress. My out-of-sample backtest results so far; On monthly data for S&P500 with loss threshold = 0%, Trigger deviation from prior value = 10%, Lookback = 36 months I get a CAGR of +10.39% versus the S&P500's +5.21% ( Jan 04 - Dec 13 ) Takes you out of the market for 18 months thereby reducing 2008 drawdown to -13.69% versus the -38.49% for ^GSPC.On Daily data for DJIA with loss threshold = 0%,Trigger deviation from prior value = 23%, Lookback = 50 Days I get CAGR of +5.24% versus DJIA's +4.92% from Dec 1928 - Dec 2013. Takes you out of the market 376 days - However that particular calibration doesn't work for the 2008 crisis because the ratio was only around 1.07 for most of 2008 which is lower than the ( 1+ 23% Trigger ). Probably need to introduce a second layer of deviation from expanding window mean to accommodate this.All calcs so far at the 50th percentile may well get better results using 95th or 99th percentile and using VaR or CVaR ( Expected Shortfall ). Returns assumed = 0% when out of market.

GZIP: On