Serving the Quantitative Finance Community

 
User avatar
tibbar
Topic Author
Posts: 10
Joined: November 7th, 2005, 9:21 pm

large data files - c++

June 1st, 2012, 8:29 pm

Hi,Can anyone suggest the fastest way of storing / retrieving arrays from a data file in c++.I need to store a grid M x N where each entry of the grid is fixed length array of doubles. Total data file size will be around 1GB and I need the total time to access and load an array to be no more than 0.1ms and hopefully less...is this feasible or is the hard drive access time going to be the constraint here (SSD's are faster I guess).Feels like the best approach is a binary file, use memory map and have an efficient lookup table at the beginning for rows and byte offsets for columns.Are there some pre-built classes out there which will make this easy? I looked at serialization classes in boost, but these don't seem to fit the bill.Many thanks.
 
User avatar
CluelessCpp
Posts: 0
Joined: April 7th, 2012, 11:45 am

large data files - c++

June 1st, 2012, 9:16 pm

If the entries are fixed length, you don't even need the lookup table - you can just calculate the offset in the file. If the structure is more complex, you might want to look at Berkeley DB, HDF or sqlite.You won't be able to achieve 0.1 ms with normal hard disks (due to seek time and rotational latency), but SSDs might be fine (as long as you only need a single read to get the data).
 
User avatar
tibbar
Topic Author
Posts: 10
Joined: November 7th, 2005, 9:21 pm

large data files - c++

June 2nd, 2012, 8:36 am

Thanks for the response. Would there be any difference in speed between pulling an array from a 1GB or 100GB file using a simple offset method?
 
User avatar
bojan
Posts: 0
Joined: August 8th, 2008, 5:35 am

large data files - c++

June 2nd, 2012, 8:49 am

The HDF5 library is designed pretty much exactly for this... It has options to memmap files when opening.
 
User avatar
Traden4Alpha
Posts: 3300
Joined: September 20th, 2002, 8:30 pm

large data files - c++

June 2nd, 2012, 11:02 am

QuoteOriginally posted by: tibbarTotal data file size will be around 1GB and I need the total time to access and load an array to be no more than 0.1ms and hopefully less...is this feasible or is the hard drive access time going to be the constraint here (SSD's are faster I guess).The speed you want is several orders of magnitude faster than current tech can provide.The fastest SSDs max out at only 28 GB/sec (and that takes 6 Infiniband ports) so it will take at least 36 ms to load 1 GB. A more normal SATA-based SSD would need about 2000 msec to 5000 msec for a 1 GB file.And even if you store the file in RAM, the memory bandwidth for a Intel Core i7 CPU supports only 30-50 GB/sec depending of clock & memory speed so simply bringing in each data value for processing will take 20 msec.If you can do it all in parallel in strips, neighborhoods, or hierarchical decomposition of the MXN, then a 0.1 msec processing time might be possible but it would need to be massively parallel on the storage side.
 
User avatar
tibbar
Topic Author
Posts: 10
Joined: November 7th, 2005, 9:21 pm

large data files - c++

June 2nd, 2012, 1:00 pm

But I don't want to load in the whole file - simply pull out data at specified offsets from the start.Surely using a memory mapped file this is possible without loading it all into ram first?
 
User avatar
Hansi
Posts: 41
Joined: January 25th, 2010, 11:47 am

large data files - c++

June 2nd, 2012, 1:31 pm

Yes, the smaller the subset the faster it will be given that the HDD isn't massively fragmented or something like that.But if the whole dataset is only 1GB, why not just keep it in memory and use an in memory database?There is a thread on memory mapping in R here, and you can see there is a massive speedup compared to loading everything into memory first, but when in memory it's quicker that loading the MM files:http://www.wilmott.com/messageview.cfm? ... TABLE=Note the timing is obviously much slower since R is uber slow.
 
User avatar
Polter
Posts: 1
Joined: April 29th, 2008, 4:55 pm

large data files - c++

June 2nd, 2012, 1:32 pm

If you want support for parallel disks STXXL might be of interest: http://stxxl.sourceforge.net/