Splitting data according to distribution moments

TraderWalrus · October 23rd, 2017, 4:24 pm

I am interested to take price (or rather, returns) data and divide it to n periods, according to distribution moments, in a way that will create the best separation. How do you approach that?

outrun · October 23rd, 2017, 8:28 pm

I would phrase it as finding the N-1 split points that split the data into N periods/buckets.

This could be an algortihm:

1) First split evenly.

Then:
2) For each point at the left/right of a boundary you compute the likelihood if it belongs to the left and right bucket and move it to the buckets that fits best. You can compute the likelihood of the return belonging to a set of return with kernel density methods. https://en.wikipedia.org/wiki/Kernel_density_estimation

TraderWalrus · October 24th, 2017, 11:29 am

I am thinking k-means clustering.

outrun · October 24th, 2017, 11:38 am

That's different from what you said earlier, it would simply categorize individual returns instead of the periods they're in.
How about a hidden markov model, things like regime switching models?

TraderWalrus · October 24th, 2017, 3:26 pm

When I use the position or index as a feature I do get something with K-Means. This however leads the algorithm to keep the certroids as far away from each other, dictating similar group sizes. Or perhaps the variation isn't very large in my data.

I am not familiar with the method you mentioned, could you provide more pointers?

Alan · October 24th, 2017, 3:36 pm

Interesting. I would try your method on absolute returns and see how the splits aligns with other possibilities, like NBER business cycle dates for US data. You could also do a GARCH-type fit and compare with that.

As a practical matter, subperiod means will be very hard to estimate, subperiod variances much easier, and subperiod higher moments very unstable. So splitting by variances (proxied by absolute returns) may make the most sense.

fomisha · October 24th, 2017, 9:36 pm

I am interested to take price (or rather, returns) data and divide it to n periods, according to distribution moments, in a way that will create the best separation. How do you approach that?

what is the metric by which different classifications will be compared? what are you trying to accomplish?

TraderWalrus · October 25th, 2017, 4:02 am

I am interested to take price (or rather, returns) data and divide it to n periods, according to distribution moments, in a way that will create the best separation. How do you approach that?
what is the metric by which different classifications will be compared? what are you trying to accomplish?

I start by comparing by volatility but I'm not really sure what's the relevant metric. The purpose is to try to map between statistical properties of the data to performance of a trading system. I will probably need to narrow it down more, for example volatility in certain times, or in specific areas (for example only when the price is above last day's range). Or it may be autocorrelation of returns. Or autocorrelation in specific areas.

outrun · October 25th, 2017, 7:18 am

The purpose is to try to map between statistical properties of the data to performance of a trading system.

So far you've focused on classifying returns (unsupervised learning), describe the data as it is. Instead I would look at regression instead of classification (supervised learning) -find the relation between your performance and the market-.

Also, instead of modelling returns without any regard for returns in the preceding days you would need to model the dynamical state of the market because your trading system has memory: your trading rules depend on both your current positions -an accumulation of past trades- and perhaps actions taken in the past like pending limit orders.

You can do all sorts of things to model the dynamical state of the market (latent variable models like stoch volatility) , but the performance of your trading system would be the most sensible measure to use to characterise the state of the market, since that's the most relevant indicator and rids you of needless intermediate models!

It would be interesting to make variants of your trading system -different triggers e.g.- and see how each performs across time. If one of them is superior in one period, and a different trading system in another period then that might be good indicators different market conditions. Of course there will always be one trading system that's better than all others and that doesn't need to mean anything, you need to worry about fake signals and overconfidence. To get a feeling of 'significant' you should run your systems on a generated random walk, e.g. by making fake price scenarios by randomly selecting past returns (your fake price scenarios will have a similar return distribution as the actual returns, but all time dependencies in the data will be erased this way).

In this setup you can then investigate if you can find a meta-model that tells you to switch between your various trading systems. That meta-model could be a classification model that makes decisions based on the past performance of all your trading systems.

TraderWalrus · October 27th, 2017, 2:59 am

Also, instead of modelling returns without any regard for returns in the preceding days you would need to model the dynamical state of the market because your trading system has memory: your trading rules depend on both your current positions -an accumulation of past trades- and perhaps actions taken in the past like pending limit orders.

You can do all sorts of things to model the dynamical state of the market (latent variable models like stoch volatility) , but the performance of your trading system would be the most sensible measure to use to characterise the state of the market, since that's the most relevant indicator and rids you of needless intermediate models!

Yes, individual returns by themselves are not enough. I also look at auto-correlation, not sure what other measures to add. Perhaps the Hurst exponent? The performance of my systems seem to strongly depend on the extent the market is mean reverting, and that changes from one period to another.

Trades that come out of a trading system will be the most direct indicator, but they are a limited sample of the characteristics of the market I try to exploit. The trading rules include many filters, some are probably less important than others, some parameters assist in getting a reasonable performance over various conditions and smoothing the equity curve rather than exploiting specific conditions, and some trades will not appear due to considerations unrelated to market behavior (e.g. not opening more then one position in parallel, or a setup occurring in hours the system cannot be monitored). So, if a certain behavior in the data provides an edge, the system will probably exploit just a small part of it, and the number of trades to analyze won't be very big. Reacting to changes in the market may be late and costly.

In any case, to use trades of a system in supervised learning, I still need to know on which features of the market to focus.

TraderWalrus · October 27th, 2017, 4:05 am

Interesting. I would try your method on absolute returns and see how the splits aligns with other possibilities, like NBER business cycle dates for US data. You could also do a GARCH-type fit and compare with that.

As a practical matter, subperiod means will be very hard to estimate, subperiod variances much easier, and subperiod higher moments very unstable. So splitting by variances (proxied by absolute returns) may make the most sense.

Absolute returns produce different partitions.
I don't particularly like the method I used, it doesn't like large variations in the period size between the groups.

outrun · October 27th, 2017, 7:44 am

Also, instead of modelling returns without any regard for returns in the preceding days you would need to model the dynamical state of the market because your trading system has memory: your trading rules depend on both your current positions -an accumulation of past trades- and perhaps actions taken in the past like pending limit orders.

You can do all sorts of things to model the dynamical state of the market (latent variable models like stoch volatility) , but the performance of your trading system would be the most sensible measure to use to characterise the state of the market, since that's the most relevant indicator and rids you of needless intermediate models!
Yes, individual returns by themselves are not enough. I also look at auto-correlation, not sure what other measures to add. Perhaps the Hurst exponent? The performance of my systems seem to strongly depend on the extent the market is mean reverting, and that changes from one period to another.

Trades that come out of a trading system will be the most direct indicator, but they are a limited sample of the characteristics of the market I try to exploit. The trading rules include many filters, some are probably less important than others, some parameters assist in getting a reasonable performance over various conditions and smoothing the equity curve rather than exploiting specific conditions, and some trades will not appear due to considerations unrelated to market behavior (e.g. not opening more then one position in parallel, or a setup occurring in hours the system cannot be monitored). So, if a certain behavior in the data provides an edge, the system will probably exploit just a small part of it, and the number of trades to analyze won't be very big. Reacting to changes in the market may be late and costly.

In any case, to use trades of a system in supervised learning, I still need to know on which features of the market to focus.

Good points.
I might be a bit of a learning curve, but recurrent neural network can build good latent representations of time series. Building a good state representation of the market is somewhat equivalent as building a very good prediction model. The logic is that if you don't have a good state representation then you won't be able to forecast very well.
I thought this article at OpenAi's blog is very interesting. It shows that a NN trained to predict the next character by reading text one character at a time ends us building a very sophisticated representation of meanings of words: https://blog.openai.com/unsupervised-sentiment-neuron/

Splitting data according to distribution moments

Splitting data according to distribution moments

Re: Splitting data according to distribution moments

Re: Splitting data according to distribution moments

Re: Splitting data according to distribution moments

Re: Splitting data according to distribution moments

Re: Splitting data according to distribution moments

Re: Splitting data according to distribution moments

Re: Splitting data according to distribution moments

Re: Splitting data according to distribution moments

Re: Splitting data according to distribution moments

Re: Splitting data according to distribution moments

Re: Splitting data according to distribution moments