Big and Nasty 1gb .txt file - Need some advice

msperlin · October 16th, 2007, 5:20 pm

Hi,I'm extracting data from 1gb files (8 million lines) where the information is intraday trading data (bid, ask, vol, etc - 16 columns) for a lot of assets and lots of days.This is how I've done it so far:- get what I want (ticker symbol) with a C++ program by copying the desired parts (lines) to a txt file- load the txt into matlab and process it (delete invalid, turn into doubles, etc).This works, but it takes 20 min in the first step. Is there any way I can speed this up ?Thanks.

msperlin · October 16th, 2007, 5:54 pm

Also, I'm writen a function in the class that checks the unique tickers in the file (the different assets). It works as it should for 5 mb files, but when I try the 1gb, it just reaches a point and it just collapses with the windows "ilegal operation or something". I'm saving the unique tickers as a vector<string> and it stops at count=250, which is not much.Is this a memory issue ? Does anyone have any idea of how to fix it?Thanks again.

kjeld · October 16th, 2007, 6:46 pm

Your approach sounds like a good one; with not too much parsing I would expect at least 10 MB/s of throughput achievable (so 100 seconds for the whole file)The vector with tickers should in principle be no problem regarding memory. However, what is the intention of that vector? Do you use it on each new line to check if it contains a new ticker?If you like I could take a look at the code; no need to send the 1 GB over since we have similar files around. -Kjeld

msperlin · October 16th, 2007, 7:00 pm

QuoteOriginally posted by: kjeldYour approach sounds like a good one; with not too much parsing I would expect at least 10 MB/s of throughput achievable (so 100 seconds for the whole file)The vector with tickers should in principle be no problem regarding memory. However, what is the intention of that vector? Do you use it on each new line to check if it contains a new ticker?If you like I could take a look at the code; no need to send the 1 GB over since we have similar files around. -Kjeldwow, 100 sec? I'm getting 20 min here... Am I doing something wrong ? (I'll show the code down).The vector is just for saving the unique tickers and test uniqueness for each iteration. Yes, it is used for each line, but it grows according to the entry of new unique tickers.Here's the code for getting the size (number of row and columns) of the file. The unique tickers function is a little more complex, but the structure is very similar to this one. Btw, I'm using MV express.int bigTextFile::getSize(){ ifstream ifs(fileName); if (!ifs) { cout<<"Something is wrong with "<<fileName<<endl; system("PAUSE");terminate(); } else { cout<<"File loaded sucessfully!"<<endl; } cout<<"Reading "<<fileName<<" contents. Please Hold.."<<endl<<endl; int countnr=0; //number of row (lines) int countnc=1; // number of collumns while (!ifs.eof()) { string line; getline(ifs,line); if (countnr==0) // for the first iter, get number of columns { int len=line.length(); char *myptr2=new char[len]; const char *linTok; line.copy(myptr2,len,0); myptr2[len]=0; linTok=strtok(myptr2,";"); if (linTok==NULL) break; while (linTok!=NULL) { linTok=strtok(NULL,";"); if (linTok==NULL) { break; } countnc++; //count the number of columns } } countnr++; //count the number of rows } nrow=countnr; ncol=countnc; ifs.close(); return 1;}

DominicConnor · October 16th, 2007, 8:56 pm

Ah this is the sort of problem that has made this topic (in) famous.I think your problem would go away (or at least become more interesting) if you used an STL set.First it would probably kill off the fact that somehow you are running out of system memory, and secondly I'd bet beer your algorithm for de-duping tickers is order N^2

DominicConnor · October 16th, 2007, 9:01 pm

OK, so it's not ON^2, but I fret a little aboutchar *myptr2=new char[len]; I see no delete ?

Athletico · October 16th, 2007, 11:39 pm

>> I'm saving the unique tickers as a vector<string> and it stops at count=250, which is not much.I completely agree w/DCFC on the std::set recommendation. But are you keeping *all* data in memory somehow as you parse for unique ticker symbols? I wouldn't expect that to be necessary. Memory should absolutely not be an issue w/250-element vectors . Check TaskMgr next time you run the app; your memory consumption will be obvious at a glance. The process grinds to a halt when the OS is forced to start paging to disk.

afoster · October 17th, 2007, 6:16 am

Why not just import the whole file into a table in a database (postgres for example) - this will be quick with a bulk insert (copy in pg) and then you can just use SQL to extract/manipulate the data easily

msperlin · October 17th, 2007, 7:57 am

QuoteOK, so it's not ON^2, but I fret a little aboutchar *myptr2=new char[len]; I see no delete ?About the N^2, I see what you mean (looping the uniqueness test over the unique tickers found up to line i). I`ll implement a better solution soon (only test for uniqueness when ticker_i!=Ticker_i-1 in a nested while() framework).Yes, I`ll using delete now. From what I`ve googled it, thats probably the cause of the error. Thanks.I`ll test it in a few hour from now, and will be back with results.QuoteI completely agree w/DCFC on the std::set recommendation. But are you keeping *all* data in memory somehow as you parse for unique ticker symbols? I wouldn't expect that to be necessary. Memory should absolutely not be an issue w/250-element vectors . Check TaskMgr next time you run the app; your memory consumption will be obvious at a glance. The process grinds to a halt when the OS is forced to start paging to disk. I`m sure I`m not saving anythign out of what I want (unique tickers). Again, I think the problem was with the new/delete.About the taskmsgr, I`ll do that. Thanks.QuoteWhy not just import the whole file into a table in a database (postgres for example) - this will be quick with a bulk insert (copy in pg) and then you can just use SQL to extract/manipulate the data easily That`s not exactly what I intended, but I`ll try that too.

INFIDEL · October 17th, 2007, 9:30 am

Quoteint countnr=0; //number of row (lines)int countnc=1; // number of collumns...nrow=countnr;ncol=countnc;Depending on what you do with nrow and ncol afterwards, declaring your counters as ints might be causing problems if the size of an int on your machine is too small. Given your input filesize, personally I'd go for a long int for countnr.

zeta · October 17th, 2007, 11:46 am

QuoteHi,I'm extracting data from 1gb files (8 million lines) where the information is intraday trading data (bid, ask, vol, etc - 16 columns) for a lot of assets and lots of days.This is how I've done it so far:- get what I want (ticker symbol) with a C++ program by copying the desired parts (lines) to a txt file- load the txt into matlab and process it (delete invalid, turn into doubles, etc).This works, but it takes 20 min in the first step. Is there any way I can speed this up ?Thanks. For large datasets I would recommend using the hdf5 C API. I've used it successfully on quantum chemistry datasets > 100M and as you probably know hdf5 is supported under octave/matlab

martinlukerbrown · October 17th, 2007, 12:49 pm

I second that, its really like doing a binary read using HDF or netCDF. However the overhead of I/O also makes sense to compress the file (using zip, gz etc) and then uncompress the file in memory as you read it in. I've seen that make the load of files an order of magnitude quicker. This is a technique that some database loggers (like the DB2 logger in mQSeries) use to speed up access, less I/O is good, cpu's are very fast these days.M

afoster · October 17th, 2007, 1:01 pm

I guess we all routinely work with large amounts of data. I tried HDF5, but dropped it as I found a tuned relational database based warehouse to be more flexable then the HDF5 file. Performance wise, with indexed columns etc accessing records from among millions is sub second on my lowley workstation, using postgres. If I need to manipulate a large dataset, then I tend to use memcached to hold it in memory, and just run my programs across it. Simple and ludicrously fast.

martinlukerbrown · October 17th, 2007, 1:25 pm

Whats memcache, just memory caching or some product?M

afoster · October 17th, 2007, 1:40 pm

Memcached is a distributed object caching system, used widely for internet sites. It's got APIs into all the major languages, and makes it dead simple to just stick an object in it (your data) with some kind of hash key, and then it just stays there alive in memory untill you clear it. Great for really fast access to data.