Tools for woking with very large data sets

KludgeDude · March 24th, 2008, 1:10 am

Hello,I'm currently working on a equity security (stocks) return archive. This archive will contains returns over multiple periods and currencies while using various calculation methods and data from various vendors. The only SQL tools I have available are limited by SQL Server 2005/8 but given the large amount of data records involved (potentially billions) I'd like to get some input on what other tools may be available to handle data sets of such size. Data could potentially be processed external to SQL (using such tolls) and then loaded on demand to increase efficiency. FAME has been suggested but I though I'd bounce the problem off this forum since I'm sure this is a problem you have all faced. Thanks for any assistance and input

bostonquant · March 24th, 2008, 1:54 am

QuoteOriginally posted by: KludgeDudeHello,I'm currently working on a equity security (stocks) return archive. This archive will contains returns over multiple periods and currencies while using various calculation methods and data from various vendors. The only SQL tools I have available are limited by SQL Server 2005/8 but given the large amount of data records involved (potentially billions) I'd like to get some input on what other tools may be available to handle data sets of such size. Data could potentially be processed external to SQL (using such tolls) and then loaded on demand to increase efficiency. FAME has been suggested but I though I'd bounce the problem off this forum since I'm sure this is a problem you have all faced. Thanks for any assistance and inputLOL. Please tell us (and excuse me if I am overlooking something) how you have "billions" of records when there are 5000 stocks in the Wilshire 5K and about 250 trading days in a year.

zarnywhoop · March 26th, 2008, 1:45 pm

Presumable KludgeDude is looking at much higher frequencies than daily. Even with 1-minute samples that makes about a billion samples per year. Looking at tick data over periods of several years could involve massive data sets.

erstwhile · March 26th, 2008, 8:57 pm

You might check out Matlab - not sure it can handle that much data, but would be worth asking them.

msperlin · March 26th, 2008, 9:28 pm

QuoteOriginally posted by: erstwhileYou might check out Matlab - not sure it can handle that much data, but would be worth asking them.I've been working with large databases and matlab cant handle it. It just chokes.What worked for me is using microsoft acess with ODBC server to just load what I need at matlab and then make the needed calculations at each iteration (each asset). But, I have found problems for very big txt files (4gb +). Acess just doesnt take them (something about OS system). In this case I just broke the whole .txt file into smaller parts (according to ticker) with a c++ program and loaded each when needed. Since is acadmic research, it solved my problem, but I'm sure how it would work for you.And, take a look at SAS. I never tried, but I know it is quite good for working with very large datasets.

bostonquant · March 27th, 2008, 2:40 am

QuoteOriginally posted by: zarnywhoopPresumable KludgeDude is looking at much higher frequencies than daily. Even with 1-minute samples that makes about a billion samples per year. Looking at tick data over periods of several years could involve massive data sets.Understood but he said that QuoteThis archive will contains returns over multiple periods and currencies while using various calculation methods and data from various vendorsI would take this to be maximum daily time periods. Remember that the equity market can be explained in a max of 15 factors (excluding industries)...

zeta · March 27th, 2008, 11:36 am

HDF5

dobranszky · March 27th, 2008, 4:05 pm

KludgeDude, I access my data stored in SQL databases through ODBC. In my C++ toolkit I use the libodbc++ library for this purpose. It is a quite simple library and easy to use. Basically, I do not loose time or precision with data transfers. The SQL query is as quick as your database server, while your C++ routines are as quick as much you optimized them.

bretth · March 27th, 2008, 4:28 pm

It really depends on how you want to analyze the data.HDF5 and CDF/NetCDF are good formats for storing large amounts of data in a way that is portable, and you can write code to pull the data out quickly. HDF5 has the advantage that you can access the data through multiple threads at the same time. Personally I find the abstraction layer in HDF5 quite hard to work with (using the straight "C" interface), and it isn't great for doing ad-hoc queries like SQL is. On the other hand, you can load up the files with Mathematica or Matlab quite easily, as well as Python. Personally I would choose this option if you want to access the data in a fast matrix-like way, or if you want to be able to transfer and analyze the data across disparate machines in a portable way.Another alternative is Berkeley DB. This will allow you to index the data if you need to fast random-access to individual records. You'll still have to code up your analysis tools, but you'll be able to jump to individual records quickly.Finally there's a full-blown SQL database. This will take up the most space, and be the slowest option in terms of large-scale data access, but the easiest to create ad-hoc queries in.Cheers, Brett