Equity fundamentals database design

Hansi · October 11th, 2011, 11:47 am

Anyone have any good articles, book recommendations or personal experience for building an effective database structure for an equity fundamentals database for historical backtesting and stock screening etc?Note a single variable is about 45000 symbols, 350 months. There are hundreds of variables.Data is taken into R where the variables are used as matrices with symbols as rows and dates as columns.I tried going with these set ups in Oracle and Sybase:A)- Fields: Symbol, Variable, Date, Value- Seperate and clustered indices for all but value.- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)- Average query for a normal sample into R: 4-8 minutesB)- Each variable in a sperate table.- Fields: Symbol, Date, Value- Seperate and clustered indices for all but value.- Views added to cache common subsets (dunno if it helped at all...)- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)- Average query for a normal sample into R: 3-5 minutesC) [Should maybe have tried a column based database here]- Symbols and dates stored seperately and map to row and col numbers only- Each variable in a seperate table with symbols for rows and dates for columns- Really bad for where data maps on disk.- Data already in correct format for RAverage query for a normal sample into R: 1-3 minutesAll of these are too slow so decided to go with gzipped binary RData workspaces containing the variables which is about 5 sec to load locally and 20 sec over the network.Is there anything I can do to make the database route come anywhere close to the binary file speeds?Any suggestions welcome.

SierpinskyJanitor · October 11th, 2011, 12:25 pm

BTW Hansi, you seem to be focusing on RDBMs here, is that a constraint, or would you consider something objet-based instead? In fact, the best business DBs in Equity Derivs I have worked for were built under this paradigm.

Hansi · October 11th, 2011, 1:18 pm

The basic options are MSSQL, Oracle and Sybase. That is only RDBMs. I could go for something else if I can justify it by being super fast.Any suggestions? Maybe one of the tabular nosql ones? An object would then be an variable? How does that scale in terms of additional symbols + dates? Again I iterate, databases are not my forte.

SierpinskyJanitor · October 11th, 2011, 1:49 pm

Hansi, apologies, but are you posting questions about a specific problem you're tackling in a REAL production environment? If this is the case, and if this is business sensitive, you must at least tell your manager that you're leveraging from this platform. Otherwise this is borderline unethical.

Hansi · October 11th, 2011, 2:13 pm

I mis-understood.

SierpinskyJanitor · October 11th, 2011, 2:17 pm

no Sir, you have in fact misunderstood my reply:It is not ethical to post PRODUCTION related issues in open forums, regardless.

Hansi · October 11th, 2011, 2:45 pm

You are right, too much context was given. I edited the text above to focus only on the issue at hand.

SierpinskyJanitor · October 11th, 2011, 2:55 pm

Thanks Hansi, one is never too careful with these things. Sometimes not even our managers are fully aware of these operational constraints and fall carelessly within audit´s radar scans... nasty consequences ( personal experience here ). By raising this issue and reediting we are actually setting standards and giving a good example since the whole forum is jeopardised in case of trouble. OK let´s refocus now.

rmax · October 11th, 2011, 11:03 pm

Without seeing the schema it is going to be tricky to give good advice, but here goes:It sounds as though you have performance issues on the data extract rather than the data load. Hence would suggest that you look at your query plan and see where it is sub optimal. On this you want to put a clustered index. On the other fields I would put a non-clusted (focused on what you are trying to extract).Variables in a separate table sounds very sub-optimal, as you are going to have to at least do mutliple index-seeks+hash joins to pull relevant data together.This is the tip of the iceberg for an RDBMS. If you want anything more then I would need to know specific details.

assylias · October 12th, 2011, 7:33 am

This is more of a hack than good DB design but I have once seen a DB where price tables had 366 columns containing the data for each day of a year. It is quite ugly but can be much faster than having a typical 2 columns [date, value] structure.So it looks like:[year, value on day 1, value on day 2, value on day 3, .... , value on day 366]It requires some cleaning when retrieving data to remove empty days.This could easily be transposed to monthly frequency, but the performance gain might be smaller.

Hansi · October 12th, 2011, 8:43 am

QuoteOriginally posted by: assyliasThis is more of a hack than good DB design but I have once seen a DB where price tables had 366 columns containing the data for each day of a year. It is quite ugly but can be much faster than having a typical 2 columns [date, value] structure.So it looks like:[year, value on day 1, value on day 2, value on day 3, .... , value on day 366]It requires some cleaning when retrieving data to remove empty days.This could easily be transposed to monthly frequency, but the performance gain might be smaller.That's similar to option C above. This is quite sub optimal because it scales badly as when you are adding in columns and rows they are all over the hdd so as it grows the seek time for the query grows really badly.

Rufus · October 12th, 2011, 8:57 am

You could look at: http://kx.com/They have a 32-bit evaluation for download.

Hansi · October 12th, 2011, 9:06 am

QuoteOriginally posted by: rmaxThis is the tip of the iceberg for an RDBMS. If you want anything more then I would need to know specific details.It's more so about what method/schema would be optimal to store this in order for it to be as fast as possible to pull into R with the ability to select only specific symbols and months.It might make more sense with this example of 5 random variables. So basically it's 5 matrices that contain doubles in the below format that I want to store in a database and be able to query back out with a select number of symbols and months at a time to reduced sized matrices in R:

Hansi · October 12th, 2011, 9:11 am

QuoteOriginally posted by: RufusYou could look at: http://kx.com/They have a 32-bit evaluation for download.If I try a column oriented approach is it better to set symbols or dates as columns? I'm assuming it's symbols?

rmax · October 13th, 2011, 7:28 am

Starting from scratch without any input I would place it in a table structure that is:DateSymboldatatypeValuePut a clustered index on Date,SymbolIndex on DatatypeSuggest you crack open the query plans and see where the performance hit is. Are you sure that it is the data seek, and not the driver from RDBMS into R?I think you should then be able to optimise by placing different tables on different disks to increase performance. Inherently I feel that Oracle/MS Query experts should have a better algorithm for finding data than something that is put into R.Would then be tempted to create a materalised view (or whatever they called are now) that breaks up the data into how you are going to query it. Should be a safe way to go as the data is the historic data is pretty static.Have never looked at the MapReduce algos, but that might be something you can look at as well. but a postage stamp would give me ample space to write down my knowlege of these.