Stupid Data Miner Tricks in Quantitative Finance

Commodore · September 2nd, 2014, 1:49 am

I was just reading a funny book chapter: Nerds on Wall StreetI'll quote from the section called, "Your Mama Is a Data Miner -- Getting a Bad Name in Computational Finance":It wasn't too long ago that calling someone a data miner was a very bad thing. You could start a fist fight at a convention of statisticians with this kind of talk. It meant that you were finding the analytical equivalent of the bunnies in the clouds, poring over data until you found something.Given that some of us are interested in searching for systematic trading strategies, what are some ways to increase the chances of profitability?Here are some things I consider now, and I'm interested if you have others.* the relationship makes sense* parsimonious model / few parameters* parameter values aren't extreme (use regularization)* start with a relationship in mind* don't try too many ways of improving the model* check that it isn't overly sensitive to parameter values* hold a couple of years' data out-of-sample for testing* must work consistently over time/stocks/markets* no black box, I need to see how decisions are being made* many trade opportunities for increased statistical significance* the distribution of returns looks reasonable

slacker · September 3rd, 2014, 2:22 pm

I like your list.I think its kind of hard to avoid DM completely and be completely parameter free. I try to avoid obvious mining. I don't have any formal strict definitions but I try to avoid 'hard' data mining over 'soft' data mining parameters. Making a silly example, I think its a smaller sin to mine and use two minute bars vs one minute bars for your general methodology because its better performance over mining a strict profit taker/stop loss bracket of -0.913*sigma to +1.562*sigma from some back test.I am not sure where I stand on out of sample testing as a philosophy. Regardless of whether I have curve fitted everything or used a clean mining free OOS method, there is nothing to really guarantee that when I go live with it the strategy won't immediately go into a four month drawdown. I'm not saying that throw away OOS but my approach is to use it with a grain of salt. Forward testing on small capital is my preferred approach. Though this plays into your second last point about having enough trades for statistical significance or else you might be forward testing for a long time lol!One caveat I think of as a retailer (and I don't think it applies to institutional guys) is that you only need to get rich once! So if you have curve fit something and it gets you to your $X net wealth target and you are honest enough to tap out the moment it happens, it doesn't matter if the strat would have blown up spectacularly eventually! of course there's no way of knowing such things.

Traden4Alpha · September 3rd, 2014, 11:57 pm

Data mining has a bad name in the same way that eating pork has a bad name in some cultures. It may be true that eating pork and data mining have historically created trouble (parasites in the case of pork, overfitting in the case of data mining), but science has since found ways to avoid or prevent the associated problems.Historically speaking, data mining inevitably led to finding beautiful patterns that are not really there. It's only through the understanding of the inevitable structures latent even in random data and the use of out-of-sample testing that data mining can be used safely. Yet many still have superstitions about data mining and pork.Of course, as soon as one uses the OOS to adjust what one does (e.g., pick one model or parameter value over another), then one needs a second level of OOS to make sure the first use didn't create overfitting. But if one understands the principles, then data mining is a powerful and valid tool.

sladner · September 10th, 2014, 3:02 pm

this would be a good idea to become familiar with http://papers.ssrn.com/sol3/papers.cfm? ... id=2460551

Commodore · September 10th, 2014, 4:59 pm

Thank you. That is a good paper. The author seems somewhat well known on this topic. Here is another paper I found of his that is interesting: http://papers.ssrn.com/sol3/papers.cfm? ... id=2326253

AnalyticalVega · September 11th, 2014, 1:59 pm

QuoteOriginally posted by: Traden4AlphaData mining has a bad name in the same way that eating pork has a bad name in some cultures. It may be true that eating pork and data mining have historically created trouble (parasites in the case of pork, overfitting in the case of data mining), but science has since found ways to avoid or prevent the associated problems.Historically speaking, data mining inevitably led to finding beautiful patterns that are not really there. It's only through the understanding of the inevitable structures latent even in random data and the use of out-of-sample testing that data mining can be used safely. Yet many still have superstitions about data mining and pork.Of course, as soon as one uses the OOS to adjust what one does (e.g., pick one model or parameter value over another), then one needs a second level of OOS to make sure the first use didn't create overfitting. But if one understands the principles, then data mining is a powerful and valid tool.The overfitting problem has NOT been solved. Read Marcos's papers again. The problem of overfitting has been mitigated somewhat. There is no way to completely avoid over fitting when using time series data.

Traden4Alpha · September 11th, 2014, 2:43 pm

QuoteOriginally posted by: AnalyticalVegaQuoteOriginally posted by: Traden4AlphaData mining has a bad name in the same way that eating pork has a bad name in some cultures. It may be true that eating pork and data mining have historically created trouble (parasites in the case of pork, overfitting in the case of data mining), but science has since found ways to avoid or prevent the associated problems.Historically speaking, data mining inevitably led to finding beautiful patterns that are not really there. It's only through the understanding of the inevitable structures latent even in random data and the use of out-of-sample testing that data mining can be used safely. Yet many still have superstitions about data mining and pork.Of course, as soon as one uses the OOS to adjust what one does (e.g., pick one model or parameter value over another), then one needs a second level of OOS to make sure the first use didn't create overfitting. But if one understands the principles, then data mining is a powerful and valid tool.The overfitting problem has NOT been solved. Read Marcos's papers again. The problem of overfitting has been mitigated somewhat. There is no way to completely avoid over fitting when using time series data.It depends on what you mean by "avoid overfitting". On the one hand, every selection/rejection or ranking process will have some overfitting issues in which random events erroneously affect selection/rejection/ranking results. On the other hand, if one knows that a system produces overfitting and if one can estimate the expected amount of overfitting, then one can deflate or debias the system. Knowing one has a problem is a powerful step to avoiding the problem.Ultimately, though, I somewhat agree that some overfitting cannot be avoided if one has a restricted sample size of bounded-duration time series. If one does not know the underlying distribution of randomness, then one cannot estimate how that randomness might induce error or bias during testing processes. In the short-term, financial time series certainly do suffer from this problem. Yet, in the long-term, the market produces new out-of-sample data every day which provides a powerful opportunity to test for and mitigate overfitting.

AnalyticalVega · September 11th, 2014, 3:14 pm

QuoteOriginally posted by: Traden4AlphaQuoteOriginally posted by: AnalyticalVegaQuoteOriginally posted by: Traden4AlphaData mining has a bad name in the same way that eating pork has a bad name in some cultures. It may be true that eating pork and data mining have historically created trouble (parasites in the case of pork, overfitting in the case of data mining), but science has since found ways to avoid or prevent the associated problems.Historically speaking, data mining inevitably led to finding beautiful patterns that are not really there. It's only through the understanding of the inevitable structures latent even in random data and the use of out-of-sample testing that data mining can be used safely. Yet many still have superstitions about data mining and pork.Of course, as soon as one uses the OOS to adjust what one does (e.g., pick one model or parameter value over another), then one needs a second level of OOS to make sure the first use didn't create overfitting. But if one understands the principles, then data mining is a powerful and valid tool.The overfitting problem has NOT been solved. Read Marcos's papers again. The problem of overfitting has been mitigated somewhat. There is no way to completely avoid over fitting when using time series data.It depends on what you mean by "avoid overfitting". On the one hand, every selection/rejection or ranking process will have some overfitting issues in which random events erroneously affect selection/rejection/ranking results. On the other hand, if one knows that a system produces overfitting and if one can estimate the expected amount of overfitting, then one can deflate or debias the system. Knowing one has a problem is a powerful step to avoiding the problem.Ultimately, though, I somewhat agree that some overfitting cannot be avoided if one has a restricted sample size of bounded-duration time series. If one does not know the underlying distribution of randomness, then one cannot estimate how that randomness might induce error or bias during testing processes. In the short-term, financial time series certainly do suffer from this problem. Yet, in the long-term, the market produces new out-of-sample data every day which provides a powerful opportunity to test for and mitigate overfitting.it's much more complicated than that. New Data may not be that relevant in understanding current market behavior. Also we never know the current underlying distribution. Perhaps we can predict the distribution of the distribution with limited success.

Traden4Alpha · September 13th, 2014, 1:11 pm

QuoteOriginally posted by: AnalyticalVegaQuoteOriginally posted by: Traden4AlphaQuoteOriginally posted by: AnalyticalVegaQuoteOriginally posted by: Traden4AlphaData mining has a bad name in the same way that eating pork has a bad name in some cultures. It may be true that eating pork and data mining have historically created trouble (parasites in the case of pork, overfitting in the case of data mining), but science has since found ways to avoid or prevent the associated problems.Historically speaking, data mining inevitably led to finding beautiful patterns that are not really there. It's only through the understanding of the inevitable structures latent even in random data and the use of out-of-sample testing that data mining can be used safely. Yet many still have superstitions about data mining and pork.Of course, as soon as one uses the OOS to adjust what one does (e.g., pick one model or parameter value over another), then one needs a second level of OOS to make sure the first use didn't create overfitting. But if one understands the principles, then data mining is a powerful and valid tool.The overfitting problem has NOT been solved. Read Marcos's papers again. The problem of overfitting has been mitigated somewhat. There is no way to completely avoid over fitting when using time series data.It depends on what you mean by "avoid overfitting". On the one hand, every selection/rejection or ranking process will have some overfitting issues in which random events erroneously affect selection/rejection/ranking results. On the other hand, if one knows that a system produces overfitting and if one can estimate the expected amount of overfitting, then one can deflate or debias the system. Knowing one has a problem is a powerful step to avoiding the problem.Ultimately, though, I somewhat agree that some overfitting cannot be avoided if one has a restricted sample size of bounded-duration time series. If one does not know the underlying distribution of randomness, then one cannot estimate how that randomness might induce error or bias during testing processes. In the short-term, financial time series certainly do suffer from this problem. Yet, in the long-term, the market produces new out-of-sample data every day which provides a powerful opportunity to test for and mitigate overfitting.it's much more complicated than that. New Data may not be that relevant in understanding current market behavior. Also we never know the current underlying distribution. Perhaps we can predict the distribution of the distribution with limited success.That's a very good point. Complications like that often reflect either an overly simplistic theory of the system (e.g., assuming IID Gaussians) or an overly narrow system boundary (e.g., ignoring exogenous influences on system parameters and structure).But I think we need to be clear on the difference between a failure to correctly predict an outcome and a failure to predict the nature of one's errors in predicting an outcome. Overfitting is more so a problem of the second kind (i.e., the overfitting creates both a bias and a false confidence) although the second kind of error certainly creates the first kind, too.Avoiding overfitting may be less about modeling an exact distribution and more about modeling how the limited sample size and experimental control on the empirical data implies bounds on what we can predict about the future. Or, more practically speaking, learn how data mining violates key assumptions of statistical estimation processes to induce overfitting.Perhaps the more important issue is that statistical distributions may simply be a deeply wrong way to model social systems such as markets and economies. Coins and dice don't study their past flips or tosses, aren't buoyed or scarred by positive or negative runs of outcomes, don't try to trick others to gain advantage, and do not read the news to see which way they should flip or roll in the next time.

AnalyticalVega · September 14th, 2014, 2:11 pm

QuoteOriginally posted by: Traden4AlphaQuoteOriginally posted by: AnalyticalVegaQuoteOriginally posted by: Traden4AlphaQuoteOriginally posted by: AnalyticalVegaQuoteOriginally posted by: Traden4AlphaData mining has a bad name in the same way that eating pork has a bad name in some cultures. It may be true that eating pork and data mining have historically created trouble (parasites in the case of pork, overfitting in the case of data mining), but science has since found ways to avoid or prevent the associated problems.Historically speaking, data mining inevitably led to finding beautiful patterns that are not really there. It's only through the understanding of the inevitable structures latent even in random data and the use of out-of-sample testing that data mining can be used safely. Yet many still have superstitions about data mining and pork.Of course, as soon as one uses the OOS to adjust what one does (e.g., pick one model or parameter value over another), then one needs a second level of OOS to make sure the first use didn't create overfitting. But if one understands the principles, then data mining is a powerful and valid tool.The overfitting problem has NOT been solved. Read Marcos's papers again. The problem of overfitting has been mitigated somewhat. There is no way to completely avoid over fitting when using time series data.It depends on what you mean by "avoid overfitting". On the one hand, every selection/rejection or ranking process will have some overfitting issues in which random events erroneously affect selection/rejection/ranking results. On the other hand, if one knows that a system produces overfitting and if one can estimate the expected amount of overfitting, then one can deflate or debias the system. Knowing one has a problem is a powerful step to avoiding the problem.Ultimately, though, I somewhat agree that some overfitting cannot be avoided if one has a restricted sample size of bounded-duration time series. If one does not know the underlying distribution of randomness, then one cannot estimate how that randomness might induce error or bias during testing processes. In the short-term, financial time series certainly do suffer from this problem. Yet, in the long-term, the market produces new out-of-sample data every day which provides a powerful opportunity to test for and mitigate overfitting.it's much more complicated than that. New Data may not be that relevant in understanding current market behavior. Also we never know the current underlying distribution. Perhaps we can predict the distribution of the distribution with limited success.That's a very good point. Complications like that often reflect either an overly simplistic theory of the system (e.g., assuming IID Gaussians) or an overly narrow system boundary (e.g., ignoring exogenous influences on system parameters and structure).But I think we need to be clear on the difference between a failure to correctly predict an outcome and a failure to predict the nature of one's errors in predicting an outcome. Overfitting is more so a problem of the second kind (i.e., the overfitting creates both a bias and a false confidence) although the second kind of error certainly creates the first kind, too.Avoiding overfitting may be less about modeling an exact distribution and more about modeling how the limited sample size and experimental control on the empirical data implies bounds on what we can predict about the future. Or, more practically speaking, learn how data mining violates key assumptions of statistical estimation processes to induce overfitting.Perhaps the more important issue is that statistical distributions may simply be a deeply wrong way to model social systems such as markets and economies. Coins and dice don't study their past flips or tosses, aren't buoyed or scarred by positive or negative runs of outcomes, don't try to trick others to gain advantage, and do not read the news to see which way they should flip or roll in the next time.It seems like you are suggesting stochastic calculus/martingale theory, which is even worse than time series analysis/empirical distributions.Method/problem:Martingale Theory - The market is not like a fair coin toss. Model is completely incorrect and useless.Stochastic Calculus - Does not model markets with any accuracy. The models are incorrect and useless.Time Series Analysis - Overfitting. We don't know the underlying distribution of any current or future market.We end up guessing what the distribution is. Estimation Risk.GARCH/Jump Diffusion Models - More complex modelsInvolving Time Series/Stochastic Calculus. The problem iswe don't know the current or future distributions of the jumps/regime changes.Dynamic Models/Chaos Theory - Sample Space problems. Not enough data to accurately classify market behaviors. Fractal Analysis - The market is a robust fractal, but we have to impose a fractal count and pattern in order to trade. That count/pattern may be wrong. So once again the imposed fractal model is wrong. The robust fractal model is correct but useless because it is not specific. Summary: Most methods don't work because they either have incorrect models, assume wrong distributions, or require more relevant market event data than exists.