Serving the Quantitative Finance Community

 
User avatar
katastrofa
Topic Author
Posts: 7440
Joined: August 16th, 2007, 5:36 am
Location: Alpha Centauri

Gap statistic problem

October 14th, 2017, 8:25 pm

On the off chance that you have come across this method yourself, can you explain where "1" in front of the formula for the standard error of the gap statistics comes from in Estimating the number of clasters in a data set via the gap statistics by Tibshirani et al. (top of p. 415)?
[$]s_k = \sqrt{\mathbf{1}+1/B}\ \text{std}(k)[$]
Does it account for the fact that one cannot remove the sampling error from the data-generating distribution (a.k.a. one cannot resample the generating distribution)? If yes, would you agree that it should not be there if my sample is not random but representative (meaning that its error is much smaller)?
 
User avatar
outrun
Posts: 4573
Joined: January 1st, 1970, 12:00 am

Re: Gap statistic problem

October 14th, 2017, 9:33 pm

Don't know.. 
I was working today on something very related, modelling a density with a Gaussian Mixture Model (last paragraph above section 3 on p 413). I use cross validation for the reference distribution, and mean log likelihood as a measure. .. but you can only do this is your objective is to model a density.

Image

You can see the trainset getting better and better approximated as you add clusters. However on the test set (compute the likelihood on samples from the test set with a model fitted on the train set) you see that 10 components is optimal (doesn't need to be exact, as long as it doesn't overfit and degrade).  You also clearly see the gap between train and test growing, indicating that it's overfitting at some point.

My main point is that at some point I stopped using statistic based on assumptions and moved to cross validation instead because it's distribution free whereas the statistical always use some assumption that causes biases somewhere. In the past I've used BIC and AIC as penalty terms for superfluous complexity (too many components, too little data), but it's not that good.
 
User avatar
katastrofa
Topic Author
Posts: 7440
Joined: August 16th, 2007, 5:36 am
Location: Alpha Centauri

Re: Gap statistic problem

October 14th, 2017, 11:56 pm

I see. You could use all validation methods to choose the model/number of clusters (they sometimes have multiple minima, but a model for which a few smaller minima overlap can be better than one favoured by a bigger minimum in a single test, if I'm clear).
My choice of the (computationally simpler) method was determined by the application - a Javascript applet (trend detection in the right panel uses k-means clustering with the above gap statistics - Warning: it's my webpage on my server so I will see your IP). I've also figured out that the [$]s_k[$] is not the Monte Carlo variance of the estimator of the gap statistics, but the estimator of the gap statistics variance. Hence the 1 in the formula is OK (I really wanted it to go because then I had more clusters and the animation looked nicer).
 
User avatar
outrun
Posts: 4573
Joined: January 1st, 1970, 12:00 am

Re: Gap statistic problem

October 15th, 2017, 8:22 am

I've been on your website before but never noticed that those images are interactive.! That's really cool! Well done.

Maybe a micro animation might convey the fact that it's interactive? The surprise is also nice on the other hand.

Good that you've solved it. Once started beyond skimming you need to get to the bottom of papers.


Yes you can pick any number of clusters and more is always better in-sample. The performance out-of-sample (on new data) is what counts for some applications. To force generalization people limit the number of clusters,. There are many theories that help manage that.

In the plot below the blue line is the model, it had increasing degrees of freedom left from right (for clustering this would be increasing number of clusters). The best fit is on the right, but why isn't that the best? The reason is that if you remove or add some data point it will change the blue line a lot, meaning the model isn't stable but learning sample noise. I really like this subject (of model choice) hence all the talk!
Image
 
User avatar
Cuchulainn
Posts: 20256
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

Re: Gap statistic problem

October 16th, 2017, 9:39 am

Sounds like a really difficult problem. My take (based on looking at the 3 diagrams) is to find the "best" function to fit a set of points. I think (correct me if I am wrong) that this is an unsolvable problem in numerical analysis and the best that can be hoped for is to try some candidates and see what happens?
BTW the plot on the far right looks like some kind of piecewise (B)spline, the others look like global polynomials? The 2nd looks like a quadratic model

// In numerical analysis high-degree polynomials (300!) are bad and will overfit. Changing one data point has global impact. This is the same as in computer graphics and CAD and hence use piecewise continuous polynomials.
 
User avatar
outrun
Posts: 4573
Joined: January 1st, 1970, 12:00 am

Re: Gap statistic problem

October 16th, 2017, 10:48 am

Yes the key is to know how to quantify "overfit" and rank models.

Eg if you have a trillion datapoints then a 300 degree polynomial might be good. Also if you propose an alternative model (piecewise continuous) then you should *quantify* why it's better and by how much.

One approach is to look at the red dots as samples from a distribution and that overfitting related to "fitting to sample noise" instead of the true (unknown) distribution.
 
User avatar
Cuchulainn
Posts: 20256
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

Re: Gap statistic problem

October 16th, 2017, 11:05 am

Eg if you have a trillion datapoints then a 300 degree polynomial might be good. Also if you propose an alternative model (piecewise continuous) then you should *quantify* why it's better and by how much.

No problem to quantify. It's well-known in numerical analysis as already mentioned. You should be aware of this body of work. If you are interested in a challenge  post a training data set and I can throw it into a spline interpolator. We'll know soon enough.

Maybe the data corresponding to this link as input?

if you have a trillion datapoints then a 300 degree polynomial might be good.
True. But that is not the issue here.

Image
 
User avatar
Cuchulainn
Posts: 20256
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

Re: Gap statistic problem

October 16th, 2017, 11:37 am

To be precise on error estimates

1. n-degree polynomial, Dahlquist and Bjorck Theorem 4.3.3  page 100
(they also explain those wiggles you are getting).

2. For cubic splines, Ahlberg, Nilson and Walsh Theorerm 2.3.3 page 27. We ca even get decent approximations to df/dx and d^2f/dx^2.
 
User avatar
outrun
Posts: 4573
Joined: January 1st, 1970, 12:00 am

Re: Gap statistic problem

October 16th, 2017, 11:59 am

I know all that.

No, it's not about error, it's about not pulling claims like "I want smoothness" or "I want locality" out of things air but instead quantify that choices based on your data.
 
User avatar
outrun
Posts: 4573
Joined: January 1st, 1970, 12:00 am

Re: Gap statistic problem

October 16th, 2017, 12:01 pm

Do your splines, lines, etc on this Ozone data and *explain* why your spline model X is better than your less fancy spline model Y. What measure are you going to use to compare them? Don't say LSE. It's a good excersise, also a we'll know test dataset.

https://en.m.wikipedia.org/wiki/Bootstrap_aggregating
 
User avatar
Traden4Alpha
Posts: 3300
Joined: September 20th, 2002, 8:30 pm

Re: Gap statistic problem

October 16th, 2017, 1:32 pm

It seems like there are three problems here:

1. Selecting the category of the approximating curve (e.g., a global polynomial, chain of splines, sinusoids, wavelets, multi-population model with independent curves, etc., etc.)

2. Estimating N: the appropriate number of terms, modes, segments, or model complexity within the selected category of model.

3. Estimating the "best fit" values of O(N) coefficients for the selected number of terms for the selected category of model.

The "overfitting" issue seems to be primarily caused by #2 in which insufficient N fails to capture valid structure in the data but excessive N is fitting to the noise. Yet it's coupled to the choices in problems #1 and #3. First, for example, it may take a large number of polynomial terms to get a good fit for a simple sinusoidal data set. Second, the coefficient estimator's treatment of noise or outliers can modulate the degree of overfitting if an excessive N is chosen (e.g., a robust outlier-ignoring estimator might avoid overfitting even if too high an N is chosen).


BTW, that "overfit" example graph may actually be the best fit curve in some scenarios. For example, in many astronomical data fitting problems there's strong evidence for (or expectation of) periodic phenomena which may be extremely sparsely sampled by the data (e.g., an object might orbit every few months but be sparsely sampled by telescope every few years). A astronomical dataset of only a half dozen samples might correctly resolve to a orbit estimate that has 20 "wiggles" during the sampling period. Solving problem #1 depends on metadata.
 
User avatar
outrun
Posts: 4573
Joined: January 1st, 1970, 12:00 am

Re: Gap statistic problem

October 16th, 2017, 2:01 pm

Yes. Good analysis. I'm talking about general methods to make rational decisions on 1) and 2) based on the data you have.

Kata's algorithm is about 2) often bigger N gives a better fit.
 
User avatar
Traden4Alpha
Posts: 3300
Joined: September 20th, 2002, 8:30 pm

Re: Gap statistic problem

October 16th, 2017, 5:19 pm

Rational decisions about #1 seem really hard unless one has metadata. Although one can often eyeball the scattergram and spot structural aspects (e.g., kinks in the curve) that might be well-modelled or poorly-modelled by different methods, that seems like an ad hoc approach subject to all the horrible flaws in human thinking about patterns and randomness.

What's interesting about optimizing N is the in-sample vs. out-of-sample fit. The Gaussian Mixing Model graph you published earlier for fit = f(N) was very interesting in showing the difference between fit to the training data versus test data. Bigger N does give better fit but worse overfit.

I'd think that the optimization of N is like an SNR problem except it's a structure-to-noise ratio in which one is trying to determine how much of the observed pattern is structure and how much is noise. In the case of estimating the number of clusters, the statistical properties of variability in each cluster (and variability in any data measurements) are noise-like effects that would surely affect estimates of structure.
 
User avatar
Cuchulainn
Posts: 20256
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

Re: Gap statistic problem

October 17th, 2017, 9:04 am

1. Selecting the category of the approximating curve (e.g., a global polynomial, chain of splines, sinusoids, wavelets, multi-population model with independent curves, etc., etc

So, what's the answer? What are the criteria that lead to our choosing method 1 over method 2?

Let me try in this way by a question: what is the rationale/reason for using global polynomials in AI? I did look in a few books but no reason was given. I suppose someone can come out and give an answer.

Global polynomials are not used in most applications of numerical analysis these days AFAIK. It is pre-1960s technique. In fairness, maybe AI is dealing with other issues.
That remark about using a 300-degree polynomial was hilarious. I hope you were joking. Was the example taken from Geron's book?
 
User avatar
outrun
Posts: 4573
Joined: January 1st, 1970, 12:00 am

Re: Gap statistic problem

October 17th, 2017, 10:24 am

Global polynomials in AI?? There are some very good courses at Coursera if you're really interested into the foundations and advances. Also some books.

This is not about specific methods but about knowing how to chose between them if all you got is data. A proof is in the pudding method. One where you are aware of the problems of overfitting: more complex models always fit data better, but .. but what? why is that a problem?