SERVING THE QUANTITATIVE FINANCE COMMUNITY

mtsm
Posts: 352
Joined: July 28th, 2010, 1:40 pm

### Re: Is there a name for this transform method?

Hi, no I did not know about NICE before your post, but I am familiar with variational Bayes and GANs. It's all over the place at the moment, albeit mostly in computer vision.

GANs are known to have raining issues and there are very large number of variations, with new ones popping out every day.

I have come up with the idea of using this technology to realize asset pricing scenarios myself, but I am not sure there would be enough bang for the buck, at least from my present perspective. I like you point of view on maybe modeling the dynamics in the latent space. That did not occur to me.

The main issue with all this technology is coming up with a truly relevant problem in finance and then modify the methodologies to fit your problem. All these algorithms are rarely plug and play. For one thing in finance the signal-to-noise ratio is ridiculous compared to a lot of computer vision, or speech or NLP stuff. Moreover when forecasting is at stake, there is often nothing or little to forecast.

outrun
Topic Author
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Is there a name for this transform method?

Yes indeed, it's rarely plug and play, and new variants of methods pop up every week! On the other hand it's a very open community, lots of datasets and code on github... but not in finance! I see occasional blog post about "forecasting stock prices with neural networks" but those are always flawed and done by kids who have no clue about statistical methods like cross validation, significance etc.

WRT forecasting, I *do* get very good results with forecasting spot volatility, performing is better (out of sample) than GARCH. And I bet there are many  more opportunities in forecasting things you can trade if you look beyond single stock returns. Pairs trading and other spreads are very good candidates imo.

Building latent space representations are the key strength of neural networks and an essential element in forecasting. Going back to vol: suppose you wanted to forecasts tomorrows volatility but all you did was look at todays return, i.e. model $P(dS_{t+1}|dS_t)$. Now if we were in the mids of a high vol period, huge intraday swings etc.. but as it turns out today's close was the same as yesterdays, then the input to the vol forecast model would $dS_t=0$, and all it can say is "give past observations where $dS_t=0$, we are very likely in a low vol period". To improve the model you could add *more* info, also yesterday's returns, and the day before. That extra information would allow it to better judge the "state" of the market. Ideally you would want to add info about lots of days and somehow aggregate in an estimate of the current state of the market. GARCH (or hidden markov models) solves this by introducing an unobservable latent volatility state and model the dynamics of that. However, the GARCH form is pulled out of thin air-ish, based on experience and mathematical convenience. Neural networks like LSTM or other RNN build there own latent state representation and model the dynamics of *that*. As it turns out those squeeze much more predictability out of the data.

E.g. below is a plot of training a simple vol forecasting model. There are two reference models in the plot, the top blue line is unconditional historical sampling -a very simple model with a good distribution (it has tails etc) but it thinks all days are alike-. The bottom blue line is GARCH(1,1) and the black line is the out of sample performance of the NN. (this thin red line is in-sample performance).

outrun
Topic Author
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Is there a name for this transform method?

What's strange is that the lumps and holes are also in the first example which you presumably copied from the paper or someone else's implementation so its not a bug in your code.  Moreover, the "double U" example has a second very strange artifact: the forward map creates anomalous density below the origin and the reverse map makes the upper U more dense than the lower U.
.. so I left it run all night to see if it was perhaps caused lack of convergence. The good thing is that it didn't collapse into an overfitted state! It also looks like it has improved (and the log likelihood of the data increased from 13.000 to 13.400). There are still lumps and holes but it's a bit difficult to asses. The random samples bottom right also have little clusters and holes, although less IMO. From a theoretical point of view it should not be a sign of a limitation of the method. You can plugin any neural network, with as many layers are you wish. Since this is a 2D example, and since the method is transforming each dimension individually I decided (not just now, but some time ago) to add a trick to better mix the dimension and allow the NN to better capture joint factors. I did this by adding rotation layers in the NN where it would simply rotate the x,y plane 45deg every now and then. The nice thing is that rotations are of course easily invertible, (they don't stretch or squeeze, very cheap computationally), and at the same create a new representation where each coordinate depends on all other coordinates  before the rotation. I think it helps in these low dimensions, but I still have to measure the usefulness of that.

However, I see I need to monitor more statistics, I have a feeling the variance of latent gaussians is not 1. Perhaps I forgot a constant somewhere? Looks more like sqrt(0.5) to me

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: Is there a name for this transform method?

I'd be curious to see what happens if the original data is a 2-D Gaussian. Then you could look at the distribution of point density in each small region relative to the analytic value of the PDF for that small region. (Or you could start with a uniform distribution and attempt to map it to a latent uniform distribution).

Obviously, the original data set will show some scatter in the empirical vs. theoretical density because random chance created regions of higher and lower density (Poisson-distributed, right?). But I'd bet that the constructed latent Gaussian will have a worse scatter.

One of the big challenges in modeling a statistical distribution from empirical data is that the specific pattern of empirical points in any given micro-region may be indicative of either random chance or true variations in the density function. It's easy to over-fit (or under fit) these empirical density variations. Just as a guess, I think this method might have a over-fitting bias.

Alan
Posts: 10270
Joined: December 19th, 2001, 4:01 am
Location: California
Contact:

### Re: Is there a name for this transform method?

Yes indeed, it's rarely plug and play, and new variants of methods pop up every week! On the other hand it's a very open community, lots of datasets and code on github... but not in finance! I see occasional blog post about "forecasting stock prices with neural networks" but those are always flawed and done by kids who have no clue about statistical methods like cross validation, significance etc.

WRT forecasting, I *do* get very good results with forecasting spot volatility, performing is better (out of sample) than GARCH. And I bet there are many  more opportunities in forecasting things you can trade if you look beyond single stock returns. Pairs trading and other spreads are very good candidates imo.

Building latent space representations are the key strength of neural networks and an essential element in forecasting. Going back to vol: suppose you wanted to forecasts tomorrows volatility but all you did was look at todays return, i.e. model $P(dS_{t+1}|dS_t)$. Now if we were in the mids of a high vol period, huge intraday swings etc.. but as it turns out today's close was the same as yesterdays, then the input to the vol forecast model would $dS_t=0$, and all it can say is "give past observations where $dS_t=0$, we are very likely in a low vol period". To improve the model you could add *more* info, also yesterday's returns, and the day before. That extra information would allow it to better judge the "state" of the market. Ideally you would want to add info about lots of days and somehow aggregate in an estimate of the current state of the market. GARCH (or hidden markov models) solves this by introducing an unobservable latent volatility state and model the dynamics of that. However, the GARCH form is pulled out of thin air-ish, based on experience and mathematical convenience. Neural networks like LSTM or other RNN build there own latent state representation and model the dynamics of *that*. As it turns out those squeeze much more predictability out of the data.

E.g. below is a plot of training a simple vol forecasting model. There are two reference models in the plot, the top blue line is unconditional historical sampling -a very simple model with a good distribution (it has tails etc) but it thinks all days are alike-. The bottom blue line is GARCH(1,1) and the black line is the out of sample performance of the NN. (this thin red line is in-sample performance).
I'd be curious to see if the neural network result (as shown) can be beat by a linear combination of the GARCH(1,1) forecast plus the closing VIX. (The optimal linear coefficients selected by minimizing the forecast error in the training period). Then, what would happen if you also allow the neural network to use the same VIX data in addition to the stock price history?

outrun
Topic Author
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Is there a name for this transform method?

Hmm. I don't think it's overfitting to small regions, the transform is smooth and it won't have enough memory/degrees of freedom to memorise the train samples. The grid paints the same picture of smooth deformations.
I'm training it on random subsets instead of the full training set, this has a similar effect than bagging. Leaving out samples while training forces generalisation and is something you always need to do with these methods and limited data.

I've done two experiment with test data, correlated gaussian, and uniform.
In the right column you see statistics of the marginal distribution (first dimension) of the two gaussian ( mid-top and mid-mid). The top right plot also has the correlation.

In general, the mean and variance converge very quickly to 0 and 1 within sample noise bounds. Skew and excess kurosis converge much slower for the stock data. Still, I don't want to add extra penalty terms to the loss function to force faster convergence of the higher moments, pure likelihood seems like the most useful base case.

and here is the uniform -> gaussian. This one is really interesting, it's not easy I epect to map a finite density to one with inf support. Maybe it's not sensible either, in an production environment I would not do this, I would preprocess it with a transform to eliminate extrapolation risk. Still the grid mapping in the top-mid latent plot shows very clear what it is doing!

I'm going to let it run because it hasn't converged yet.

outrun
Topic Author
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Is there a name for this transform method?

Yes indeed, it's rarely plug and play, and new variants of methods pop up every week! On the other hand it's a very open community, lots of datasets and code on github... but not in finance! I see occasional blog post about "forecasting stock prices with neural networks" but those are always flawed and done by kids who have no clue about statistical methods like cross validation, significance etc.

WRT forecasting, I *do* get very good results with forecasting spot volatility, performing is better (out of sample) than GARCH. And I bet there are many  more opportunities in forecasting things you can trade if you look beyond single stock returns. Pairs trading and other spreads are very good candidates imo.

Building latent space representations are the key strength of neural networks and an essential element in forecasting. Going back to vol: suppose you wanted to forecasts tomorrows volatility but all you did was look at todays return, i.e. model $P(dS_{t+1}|dS_t)$. Now if we were in the mids of a high vol period, huge intraday swings etc.. but as it turns out today's close was the same as yesterdays, then the input to the vol forecast model would $dS_t=0$, and all it can say is "give past observations where $dS_t=0$, we are very likely in a low vol period". To improve the model you could add *more* info, also yesterday's returns, and the day before. That extra information would allow it to better judge the "state" of the market. Ideally you would want to add info about lots of days and somehow aggregate in an estimate of the current state of the market. GARCH (or hidden markov models) solves this by introducing an unobservable latent volatility state and model the dynamics of that. However, the GARCH form is pulled out of thin air-ish, based on experience and mathematical convenience. Neural networks like LSTM or other RNN build there own latent state representation and model the dynamics of *that*. As it turns out those squeeze much more predictability out of the data.

E.g. below is a plot of training a simple vol forecasting model. There are two reference models in the plot, the top blue line is unconditional historical sampling -a very simple model with a good distribution (it has tails etc) but it thinks all days are alike-. The bottom blue line is GARCH(1,1) and the black line is the out of sample performance of the NN. (this thin red line is in-sample performance).
I'd be curious to see if the neural network result (as shown) can be beat by a linear combination of the GARCH(1,1) forecast plus the closing VIX. (The optimal linear coefficients selected by minimizing the forecast error in the training period). Then, what would happen if you also allow the neural network to use the same VIX data in addition to the stock price history?
That's *exactly* the plan!
The spot vol model predicts the "full" next day distribution (not just the expected mean vol), ..but it's modelled as bins. The plan is to add the VIX curve, both as input and as an output so that we can use the model to MC simulate the vol market as a whole forward time (to do so we need forecasts of future model inputs), i.e. generate multistep forecasts via sampling.

The problem is that binning doesn't scale with dimensions. Having 4000 datapoints and 10 bins is fine. But if we move to 8d then end up with 10^8 bins -far more than the number datapoints we have! So to fix that I needed a better representation of densities, but at the same time the representation needs to be powerful enough to capture and weird shapes or dependencies if they were in the data. E.g. I expect the VIX curve to have weird non-linear mean reverting behaviour.

So this idea is indeed next on the list, but it might take a week or so, I need to connect the spot vol model that can learn the latent state evolution dynamics with this model that can capture densities and transform them to  easy to sample and learn latent represenations.

outrun
Topic Author
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Is there a name for this transform method?

Here's the U->N mapping after convergence. Considering it can't possibly know that the source was U -and that little density fluctuations around the edges are real or not- it's doing a very good mapping.

The sample noise in the first 4 moments statistic (mean, std, skew and excess kurtosis) are hovering around the theoretical values, and indistinguishably from finite sample statistics computed from samples drawn from a true 2d Gaussian.

I'm now going to focus on the next steps: non-linear dimension reduction with "autoencoders". The plan is to learn compact representations of curves and surfaces (and benchmark it against e.g. PCA) with many points on them, and then model the dynamics of that compact representation.

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: Is there a name for this transform method?

Cool results! (Did the system recreate the Box–Muller transform?)

The univariate densities still look a little skewed on the final N -> U map but I'm sure some of it can't be helped where tails are concerned . BTW, did the inverse map create any samples outside the U^2 interval?

outrun
Topic Author
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Is there a name for this transform method?

It has been a while!
I've been experimenting with this method in higher dimensions, and I've also tried to adapt it to model conditional densities instead of regular densities.

I trained the model with images of handwritten digits. The images are 28x28 pixels and pixel intensity between 0 and 1. Each image can be seen as a sample from a 784 dimensional hypercube space. Here are some example images I fed to the network, there are from the well know MNIST dataset.

And below you see some generates random samples. The samples are conditioned on the digit seen in the image.
• The top row has the same digit condition (7,3,4,6,..) as the top row in the sample input example above.
• The middel low are random samples conditioned on the digit being labeled with 0,1,2,3...9. This is to test if it's conditioning, or it it's ignoring the condition.
• The bottom row are all random sample conditioned on the number being a 5. This is to test if there is variability in the posterior density. A possible failure would be that it would show the exact same image every time we ask for a "5". I've ran into this issue with GAN methods, it's called mode collapse

Not bad, no? It clearly captures density structure and dependencies in this high dimensional space, ..it's also conditioning..

outrun
Topic Author
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Is there a name for this transform method?

I did some test to predict tomorrows return distribution given past returns, including / excluding VIX data:

The performance is the black line in the right plot. The left plot is the performance of data it looked at during traing, the right plot is the performance on unseen data.

* 1st: just using price return, is has a cross-entropy (lower is better) of 4.479.
* 2nd: only using traded volume and nothing else, it has a cross-entropy of 4.558 "terrible model" is what Trump would have to say about it
* 3rd: uses both price, traded volume and VIX info, 4.467. "great model"
* 4th: using just VIX info, no price info 4.480

So, traded volume contains *some* vol info as expected, but not as much a price or VIX data. Both price and VIX info contains enough into to make a good conditional vol model, and looking at either close-to-close returns or adding OHLC data didn't make any difference.  Combining them both improved the performance just a little bit.

I tried MANY more things (all sort of neural network topologies) and my view is that on a daily return time scale, and using this limited amount of data (just 10 years of daily observations)  the network to quickly learn the limit amount of information and structure that's contained in the data. What remains is purely unpredictable noise that can't be predicted by adding more indicators.

I'm going to move to high freq data, maybe minute bars, which gives me 50x more data. Working with this little data was a real challenge! It also made me think that this data issue needs to be solved if you want to stay on daily resolution. I also expect that it will outperform GARCH with an even bigger margin it we make multi-day forecasts.

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: Is there a name for this transform method?

Nice!

With 50X more data and some windowing of the data, I wonder if you'll find evidence of significant regime changes in which newer data behaves differently than old.

Alan
Posts: 10270
Joined: December 19th, 2001, 4:01 am
Location: California
Contact:

### Re: Is there a name for this transform method?

@outrun,
If I understand it, you are predicting the one day ahead price return distribution, divided into 10 bins. But I don't understand the performance measurement of that prediction, since that distribution is not an observable AFAIK. Can you elaborate on what this cross entropy thing is and how it measures the prediction error? (I see the wikipedia discussion of it, but I think I would benefit from hearing an explanation in the context here).

outrun
Topic Author
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Is there a name for this transform method?

@outrun,
If I understand it, you are predicting the one day ahead price return distribution, divided into 10 bins. But I don't understand the performance measurement of that prediction, since that distribution is not an observable AFAIK. Can you elaborate on what this cross entropy thing is and how it measures the prediction error? (I see the wikipedia discussion of it, but I think I would benefit from hearing an explanation in the context here).
Yes, predicting the one day ahead price return distribution, but divided into 100 bins.

Below a lengthy post trying to explain the details like "what does 4.50 cross entropy mean?". I'm abusing this thread to try put thing on paper . Getting relevant questions like yours and trying to answer them helps me improve clarity!

Choice of buckets and the reason:
• I've split the data into two subsets: one for training and one for backtesting. The risk with neural networks is that they memorise the data instead of generalise (find general patterns). Having a separate trains and test set helps you monitor that.
The choice of bins is based solely on the training data, and once computed the bins stay fixed, I use the same binds for back testing.
• The bins split the empirical return distribution (in the train data) into equal probable bins. This ensures that if the model says "I think tomorrow's return will be in bin n.r. 45" that if will give you maximum information. If I had picked bin 45 to have 99% of the return then it wouldn't be very informative.
• The empirical distribution is a very UNlikely distribution when thinking about the distribution of future returns. Historical returns is just "a finite set of returns" and if you translate that directly to a density function you'll get a density that's a set of Dirac delta spikes. The probability of observing a specific return in the future that has not been seen in the past will be zero with this model. A good remedy is pick a very flexible non-parametric density model and optimise it's predictive power using cross validation. The most popular method here is the Kernel density estimate with a 'width' parameter. This method replaces each sample with little gaussian with a "width" that has to be set (called the kernel bandwidth). This width acts as a smoothening factor. There are heuristics to pick an optimal width, but I like the "cross validation" method best. You pick 80% of the data as kernels centers, and compute the optimal width by maximizing the probability that the 20% remaining samples are drawn from it. Below you see the S&P return distribution and the kernel density with various bandwidth parameters.

This density is used to to map "returns" into "equal probably bins". At the left in the plot below you see the training data  mass in each bin, at the right the bins. In the plot I've used 30 bins, but the idea is the same. It's not exactly uniform but that's the result of added a smoothness constraints. The bins are  good enough though, each bin has a good amount of mass and add to modelling the shape of the distribution. The main purpose of the bins it to have a simple representation of a density which will be the output of the model. Also, we know that the prices return distributions isn't constant, so we expect bin probabilities to fluctuate anyway, e.g. in high volatility phases the probability mass will move to the tails.

Entropy:
If we had just 2 equal probability buckets then knowing if a sample it's going to be drawn from a particular buckets gives you 1 bit of information. With 8 buckets if would be 3 bits, and in general with N equal probable buckets you'll gain log2(N) bits of information knowing in what bucket something is. However, instead of talking about bits and using log2()  people decided to use ln() with base e and call it entropy instead of bits. Now if we go back to my plots, you see that the "historical sampling" has a value around "4.6". That because we have 100 equal probable-ish buckets and not being able to do any better than saying that each bucket is just as likely as the other means that you're missing ln(100) = 4.605 bits of information. That's the performance of the historical sampling model. You miss on average 4.6 bits of information that tell you uniquely from what bin tomorrows return will come. In my example its' not exactly 4.605 because of the smoothing and the mass not begin exactly 1/100 in each bucket.

Cross entropy:
The neural network can be seen as function with parameters $\Theta$, inputs $I(t)$ (e.g. the past returns) and who's output is a set of 100 class probabilities estimates $\hat{y} = \{p_1,p_2,\cdots,p_{100}\}$ when $p_i$ is the guessed probability that a return is in bin $i$. To train the network we need to modify $\Theta$ in such a way that if predicts good probabilities. "Good" depends on what you want, but a logical choice is to pick parameters $\Theta$ that maximise the likelihood that the observed returns came from the predicted class density.

The likelihood function is very simple, if a sequence of returns comes from bin #45,#13,#5.. then the likelihood is simply $p_{45} p_{13} p_{5}$. The log likelihood is $\log(p_{45}) +\log(p_{13}) + \log(p_{5})$.

Using a different notation (1 hot encoding) we can write done "return came from bin 45" as a vector y=[0 0 0 .. 0 1 0 0 .. 00], i.e. al zeros, except with the 45th element set to 1. This notation happens also to be the true (categorical) probability distribution of the return (this is probably the crucial insight that links CE to ML)! The log likelihood of observing $y$ given the estimated distribution $\hat{y}$ then become
$\sum_{i=1}^{100} y \log\hat{y}$

...which is the cross entropy function! (apart from a sign)

So for this type of problem maximizing the (log) likelihood is equivalent to minimising the cross entropy. There is also a close link to "Kulback Leibler divergenz" which is a measure of mismatch between two distriubutions (the are *the same*, up to a constant), and there are tons of other reasons specifically related to gradient descent for using log likelihood / cross entropy instead of other loss measures. Here is a nice link with runnable demos to see the effect of various choices on gradient descent used to find optimal $\Theta$

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: Is there a name for this transform method?

Interesting.

Mapping each day's realized data point through that day's predicted cumulative distribution results in a random variable that should have a uniform distribution across all the days.

Even if one only has one sample point drawn from each daily distribution, one examine the uniformity of the CDF(x) to test or measure the quality and properties of the underlying distribution forecasting process. The plot of historical probability per bucket should be flat within the statistical limits of poisson arrival processes but it's not:

The central lump in historical probability per bucket suggests that the forecast distributions under-predict small movements and over-predict large ones. Yet the relative flatness of the graph on the left and right tails suggests forecast distributions do have the correct tail properties.

The one issue that is not clear in the binned data is whether any days had a realized CDF outcome of 0 or 1. These are potentially very serious events in which the actual outcome was "impossible" according to the forecast distribution and the error between actual and forecast returns might be unbounded.