high-frequency lead/lag relationships

jbraswell314 · September 13th, 2007, 9:12 pm

Hi, all. I want to investigate whether or not certain securities have a lead-lag relationship on a small time-frame. My plan is to get the bid/ask midpoint every 100 ms or so for each series, difference those midpoints to get the price changes, and then regress the resulting series against each other. Then, I was going to lag one series and re-run the regression against the other series for a number of lags. My thinking is that if I get a better correlation coefficient on one of the lags, then it's good reason to believe there is a tradable lead-lag relationship between the two products.Any reason why this shouldn't work? Any common ways to sharpen and improve the results?thanks

Traden4Alpha · September 14th, 2007, 12:07 pm

1. Watch out for overfitting. In any large dataset, there will be patterns that occur by chance. Make sure you retest any found relationship against out-of-sample data (preferably new data that you've never tested before). You might also want to construct a randomized dataset, cross correlate that with your lead-lag system and study the distribution of the correlation coefficients. Testing on randomized data will give you a feel for when a correlation coefficient is abnormally good.2. Watch out for transaction costs. Even if the correlation coefficient is great, the amount of explained price variance (in currency units) may be too small to cover the transaction costs of trading. Paradoxically, you may get higher profits by trading a lower correlation coefficient setup on a higher variance dataset than trading a higher correlation coefficient setup on a lower variance dataset.3. Watch out for latency and jitter issues. You may find that lag = X works best, but in implementing a trading system you have some unavoidable lags in getting the data, cleaning it, processing it, generating orders, submitting them, waiting for them to appear on the exchange, and waiting for them to execute. These delays will need to be accommodated by the system. Unfortunately, some of these delays may have variations which will mean you are really trading a system with lag +/- delta, where delta is a highly skewed random variable.

jbraswell314 · September 18th, 2007, 4:12 pm

OK, thanks for the tips. With regards to 1, what exactly do you mean by the "distribution of correlation coefficients?" Do you mean the coefficients for all of the different lag times?Thanks again.

Traden4Alpha · September 18th, 2007, 6:18 pm

What I mean is that with random data (created to have the "same" distribution of log returns as the true data but to have no correlation at any lag), the correlation coefficients across lags will follow some distribution. In particular, you want to understand the distribution of extrema because that is how you plan to pick relevant values of lag and random chance will lead to extreme values on occasion. Thus you want to understand that if you have data that is truly patternless at all lags, what is the expected largest magnitude correlation coefficient that might be seen? How much can random variations in the data cause largest magnitude numerical value of the correlation coefficient to be above some value?The point is to construct a test in which you compare the observed correlation coefficients (on real data) against those expected from random data.