September 14th, 2014, 2:11 pm
QuoteOriginally posted by: Traden4AlphaQuoteOriginally posted by: AnalyticalVegaQuoteOriginally posted by: Traden4AlphaQuoteOriginally posted by: AnalyticalVegaQuoteOriginally posted by: Traden4AlphaData mining has a bad name in the same way that eating pork has a bad name in some cultures. It may be true that eating pork and data mining have historically created trouble (parasites in the case of pork, overfitting in the case of data mining), but science has since found ways to avoid or prevent the associated problems.Historically speaking, data mining inevitably led to finding beautiful patterns that are not really there. It's only through the understanding of the inevitable structures latent even in random data and the use of out-of-sample testing that data mining can be used safely. Yet many still have superstitions about data mining and pork.Of course, as soon as one uses the OOS to adjust what one does (e.g., pick one model or parameter value over another), then one needs a second level of OOS to make sure the first use didn't create overfitting. But if one understands the principles, then data mining is a powerful and valid tool.The overfitting problem has NOT been solved. Read Marcos's papers again. The problem of overfitting has been mitigated somewhat. There is no way to completely avoid over fitting when using time series data.It depends on what you mean by "avoid overfitting". On the one hand, every selection/rejection or ranking process will have some overfitting issues in which random events erroneously affect selection/rejection/ranking results. On the other hand, if one knows that a system produces overfitting and if one can estimate the expected amount of overfitting, then one can deflate or debias the system. Knowing one has a problem is a powerful step to avoiding the problem.Ultimately, though, I somewhat agree that some overfitting cannot be avoided if one has a restricted sample size of bounded-duration time series. If one does not know the underlying distribution of randomness, then one cannot estimate how that randomness might induce error or bias during testing processes. In the short-term, financial time series certainly do suffer from this problem. Yet, in the long-term, the market produces new out-of-sample data every day which provides a powerful opportunity to test for and mitigate overfitting.it's much more complicated than that. New Data may not be that relevant in understanding current market behavior. Also we never know the current underlying distribution. Perhaps we can predict the distribution of the distribution with limited success.That's a very good point. Complications like that often reflect either an overly simplistic theory of the system (e.g., assuming IID Gaussians) or an overly narrow system boundary (e.g., ignoring exogenous influences on system parameters and structure).But I think we need to be clear on the difference between a failure to correctly predict an outcome and a failure to predict the nature of one's errors in predicting an outcome. Overfitting is more so a problem of the second kind (i.e., the overfitting creates both a bias and a false confidence) although the second kind of error certainly creates the first kind, too.Avoiding overfitting may be less about modeling an exact distribution and more about modeling how the limited sample size and experimental control on the empirical data implies bounds on what we can predict about the future. Or, more practically speaking, learn how data mining violates key assumptions of statistical estimation processes to induce overfitting.Perhaps the more important issue is that statistical distributions may simply be a deeply wrong way to model social systems such as markets and economies. Coins and dice don't study their past flips or tosses, aren't buoyed or scarred by positive or negative runs of outcomes, don't try to trick others to gain advantage, and do not read the news to see which way they should flip or roll in the next time.It seems like you are suggesting stochastic calculus/martingale theory, which is even worse than time series analysis/empirical distributions.Method/problem:Martingale Theory - The market is not like a fair coin toss. Model is completely incorrect and useless.Stochastic Calculus - Does not model markets with any accuracy. The models are incorrect and useless.Time Series Analysis - Overfitting. We don't know the underlying distribution of any current or future market.We end up guessing what the distribution is. Estimation Risk.GARCH/Jump Diffusion Models - More complex modelsInvolving Time Series/Stochastic Calculus. The problem iswe don't know the current or future distributions of the jumps/regime changes.Dynamic Models/Chaos Theory - Sample Space problems. Not enough data to accurately classify market behaviors. Fractal Analysis - The market is a robust fractal, but we have to impose a fractal count and pattern in order to trade. That count/pattern may be wrong. So once again the imposed fractal model is wrong. The robust fractal model is correct but useless because it is not specific. Summary: Most methods don't work because they either have incorrect models, assume wrong distributions, or require more relevant market event data than exists.
Last edited by
AnalyticalVega on September 13th, 2014, 10:00 pm, edited 1 time in total.