February 18th, 2014, 12:11 am
By "online," I assume you want something that can judge P(outliner) for the most current value of a time series. It also seems you want outlier detection when the distribution is non-Gaussian.How about:1. Rank (i.e., what is the rank of the current value WRT some representative historical dataset) is totally non-parametric and great for arbitrary (but IID) data. But it won't work well if volatility varies over time (it will have lots of false positives during times of high vol and miss outliers during times of low vol)2. Rank of the short-term Z-score is more robust to heteroschedasticity but then you get into the game of tuning the duration of the estimators for mean and variance for the Z value. Personally, I love exponential moving estimators for mean and variance because: they forget smoothly; converge relatively quickly to any step-change in the true value of the estimated variable; and the estimator is lightweight in both memory and CPU.Of course, anything with rank has the computational issue of holding a window of data. And if your data has heavy-tails, then you want a window big enough to encompass a statistically meaningful number of tail events. YUCK!3. Track E(Z > x) for a series of x values such as {1, 2, 3, 4, 5}. E(Z > x) is an estimator for the CDF of the positive tail (at a discrete set of points) which you can use to estimate P(outlier) in that direction. You'll also want a E(Z < x) with x<0 to empirically model the negative tail or use E(|Z| > x) if your distribution is symmetric. Doing this "right" can be hard because the window size (or duration of the exponential moving average) needs to be tuned to the frequency of events.The warm-up is not really possible unless one has some nice parametric model for the distribution (and the parameters of that model stabilize at low-sample sizes). By definition, tail events are rare which implies that characterizing them can take a lot of samples. If you have independent data streams that all have the same semiheavy tail distribution (perhaps with the exception of some trivial location or scale parameter), then you could use normalized cross-sectional data to more quickly build an empirical model of the tails. P.S. If the tails of your distribution follow some curve that is linear in some transformed variable space (e.g., the log-log transform for power law tails), then you can create an online curve-fit for the tails.
Last edited by
Traden4Alpha on February 17th, 2014, 11:00 pm, edited 1 time in total.