(incremental/parallel) statistics
Posted: February 1st, 2012, 9:29 pm
Dear QFCLers, I open a whole new thread for discussing statistics, as there's enough to decide here already.[edit: sorry I didn't notice the other one in time, can we merge them and delete this thread altogether?]We have different challenges:- providing numerically good estimators- get them running incrementally? (please especially comment on this; a few algorithms require two passes, while in some cases plain processing of the whole sample set is already fine)- perform estimates in a parallel setting- get everything fast- new algos should ideally also provide a weighted form- [integrate with boost accumulators]- [stay SIMD & GPU friendly]- [beware of QMC]- do we want to also add algos for trading? e.g. moving averages (in boost) & other windowed statistics, EWMA and so on... (I once saw of a OS lib for this but was taken offline). I think it would be great to expand the scope beyond just pricing&risk... but it might complicate the design. What accumulators would you like to see?Outrun has nice ones, and I got a few ideas for others.Parallel accumulators; here there are many different approaches leading to different designs:- ex post merging of n independent estimates (low accuracy, might also lead to bias) - easiest to implement- ex post merging of accumulator working data, and final estimate on such - inherently parallel algorithms (such as enhanced GK)- collection of all samples from different workers, and subsequent application of standard quantile algorithms (possibly with parallel implementation)Clearly approaches with most potential (for speed and accuracy) are also the more involved.I find it difficult to choose a priori an approach (and thus corresponding design) before having done a through comparison.Ideally then it would be great to devise a framework flexible enough to cover all options.Is any of you familiar with the accumulator framework in Boost? I wonder wether we should extend it or write independent code.Documentation is not so clear about numerical accuracy, I shall write some tests, e.g. for the rolling sum...For the case of quantile and conditional expectations, boost offers:- global ICDF approximation via P^2 algorithm- tail quantile (what's the complexity?)- tail conditional expectation- peak over threshold (from extreme value theory)- peak over threshold conditional expectation(what are tail_variate & c?)There are weighted versions already, but parallelization might not be immediate.Also the quality could be improved, and some accuracy tests would be nice.Most importantly, the benchmark case of (accurate) methods working on the whole set are missing.