@outrun,

If I understand it, you are predicting the one day ahead price return distribution, divided into 10 bins. But I don't understand the performance measurement of that prediction, since that distribution is not an observable AFAIK. Can you elaborate on what this cross entropy thing is and how it measures the prediction error? (I see the wikipedia discussion of it, but I think I would benefit from hearing an explanation in the context here).

Yes, predicting the one day ahead price return distribution, but divided into

**100** bins.

Below a lengthy post trying to explain the details like "what does 4.50 cross entropy mean?". I'm abusing this thread to try put thing on paper

. Getting relevant questions like yours and trying to answer them helps me improve clarity!

**Choice of buckets and the reason:**
- I've split the data into two subsets: one for training and one for backtesting. The risk with neural networks is that they memorise the data instead of generalise (find general patterns). Having a separate trains and test set helps you monitor that.

The choice of bins is based solely on the training data, and once computed the bins stay fixed, I use the same binds for back testing.
- The bins split the empirical return distribution (in the train data) into
**equal probable** bins. This ensures that if the model says "I think tomorrow's return will be in bin n.r. 45" that if will give you maximum information. If I had picked bin 45 to have 99% of the return then it wouldn't be very informative.
- The empirical distribution is a very UNlikely distribution when thinking about the distribution of future returns. Historical returns is just "a finite set of returns" and if you translate that directly to a density function you'll get a density that's a set of Dirac delta spikes. The probability of observing a specific return in the future that has not been seen in the past will be zero with this model. A good remedy is pick a very flexible non-parametric density model and optimise it's predictive power using cross validation. The most popular method here is the Kernel density estimate with a 'width' parameter. This method replaces each sample with little gaussian with a "width" that has to be set (called the kernel bandwidth). This width acts as a smoothening factor. There are heuristics to pick an optimal width, but I like the "cross validation" method best. You pick 80% of the data as kernels centers, and compute the optimal width by maximizing the probability that the 20% remaining samples are drawn from it. Below you see the S&P return distribution and the kernel density with various bandwidth parameters.

This density is used to to map "returns" into "equal probably bins". At the left in the plot below you see the training data mass in each bin, at the right the bins. In the plot I've used 30 bins, but the idea is the same. It's not exactly uniform but that's the result of added a smoothness constraints. The bins are good enough though, each bin has a good amount of mass and add to modelling the shape of the distribution. The main purpose of the bins it to have a simple representation of a density which will be the output of the model. Also, we know that the prices return distributions isn't constant, so we expect bin probabilities to fluctuate anyway, e.g. in high volatility phases the probability mass will move to the tails.

**Entropy:**
If we had just 2 equal probability buckets then knowing if a sample it's going to be drawn from a particular buckets gives you 1 bit of information. With 8 buckets if would be 3 bits, and in general with N equal probable buckets you'll gain log2(N) bits of information knowing in what bucket something is. However, instead of talking about bits and using log2() people decided to use ln() with base e and call it entropy instead of bits. Now if we go back to my plots, you see that the "historical sampling" has a value around "4.6". That because we have 100 equal probable-ish buckets and not being able to do any better than saying that each bucket is just as likely as the other means that you're missing ln(100) = 4.605 bits of information. That's the performance of the historical sampling model. You miss on average 4.6 bits of information that tell you uniquely from what bin tomorrows return will come. In my example its' not exactly 4.605 because of the smoothing and the mass not begin exactly 1/100 in each bucket.

**Cross entropy:**
The neural network can be seen as function with parameters [$]\Theta[$], inputs [$]I(t)[$] (e.g. the past returns) and who's output is a set of 100 class probabilities estimates [$]\hat{y} = \{p_1,p_2,\cdots,p_{100}\}[$] when [$]p_i[$] is the guessed probability that a return is in bin [$]i[$]. To train the network we need to modify [$]\Theta[$] in such a way that if predicts good probabilities. "Good" depends on what you want, but a logical choice is to pick parameters [$]\Theta[$] that maximise the likelihood that the observed returns came from the predicted class density.

The likelihood function is very simple, if a sequence of returns comes from bin #45,#13,#5.. then the likelihood is simply [$]p_{45} p_{13} p_{5}[$]. The

**log** likelihood is [$]\log(p_{45}) +\log(p_{13}) + \log(p_{5}) [$].

Using a different notation (1 hot encoding) we can write done "return came from bin 45" as a vector y=[0 0 0 .. 0 1 0 0 .. 00], i.e. al zeros, except with the 45th element set to 1. This notation happens also to be the

*true (categorical) probability distribution* of the return (this is probably the crucial insight that links CE to ML)! The log likelihood of observing [$]y[$] given the estimated distribution [$]\hat{y}[$] then become

[$]\sum_{i=1}^{100} y \log\hat{y}[$]

...which is the cross entropy function! (apart from a sign)

So for this type of problem maximizing the (log) likelihood is equivalent to minimising the cross entropy. There is also a close link to "Kulback Leibler divergenz" which is a measure of mismatch between two distriubutions (the are *the same*, up to a constant), and there are tons of other reasons specifically related to gradient descent for using log likelihood / cross entropy instead of other loss measures.

Here is a nice link with runnable demos to see the effect of various choices on gradient descent used to find optimal [$]\Theta[$]