Hey Outrun,good, so for a first start I would suggest:- "parallel" reduction of two min/max/mean etc accumulator states into one- "parallel" reduction of two P^2 quantile accumulator states - "parallel" reduction of two tail accumulator states - global framework for using the above accumulator pair reductions to reduce more (either in a sequential or tree-like fashion)- include a collector of all samples from all threads into a single sample set, for offline and serial statistics, as a benchmark for quality and unit testing (or the other way round: generate sample sequentially, split them for many threads, accumulate in parallel, reduce and compare to sequential statistics of the whole original sample; what approach do you prefer?)- performance testing- add other quantile algos, e.g. those from the papers you posted, and maybe the extended GK algorithm too- CUDA versions of what exactly? and in what context? what is the framework? some reductions are implemented in thrust, but is that the right level for orchestrating it all? how to handle multiGPU or even just multicore+GPU? - and what about OpenCL? why prefer something proprietary and less stable other than for performance and simil-C++?Once this basic stuff is ready we can think of more complicated algorithms.Who volunteers where? I'd like the quantile stuff...For algo trading remember that there's now the circular_buffer<T, Alloc> container available. But here maybe someone from the field should comment and maybe post a wishlist
We do have some nice algorithms, but still unsure about the extent to which opensource'm.As for your accumulators, are they easy to parallelize? Both usage options seem interesting, but I'll write you more about this topic.