March 27th, 2008, 4:28 pm
It really depends on how you want to analyze the data.HDF5 and CDF/NetCDF are good formats for storing large amounts of data in a way that is portable, and you can write code to pull the data out quickly. HDF5 has the advantage that you can access the data through multiple threads at the same time. Personally I find the abstraction layer in HDF5 quite hard to work with (using the straight "C" interface), and it isn't great for doing ad-hoc queries like SQL is. On the other hand, you can load up the files with Mathematica or Matlab quite easily, as well as Python. Personally I would choose this option if you want to access the data in a fast matrix-like way, or if you want to be able to transfer and analyze the data across disparate machines in a portable way.Another alternative is Berkeley DB. This will allow you to index the data if you need to fast random-access to individual records. You'll still have to code up your analysis tools, but you'll be able to jump to individual records quickly.Finally there's a full-blown SQL database. This will take up the most space, and be the slowest option in terms of large-scale data access, but the easiest to create ad-hoc queries in.Cheers, Brett