Page 1 of 2
Equity fundamentals database design
Posted: October 11th, 2011, 11:47 am
by Hansi
Anyone have any good articles, book recommendations or personal experience for building an effective database structure for an equity fundamentals database for historical backtesting and stock screening etc?Note a single variable is about 45000 symbols, 350 months. There are hundreds of variables.Data is taken into R where the variables are used as matrices with symbols as rows and dates as columns.I tried going with these set ups in Oracle and Sybase:A)- Fields: Symbol, Variable, Date, Value- Seperate and clustered indices for all but value.- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)- Average query for a normal sample into R: 4-8 minutesB)- Each variable in a sperate table.- Fields: Symbol, Date, Value- Seperate and clustered indices for all but value.- Views added to cache common subsets (dunno if it helped at all...)- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)- Average query for a normal sample into R: 3-5 minutesC) [Should maybe have tried a column based database here]- Symbols and dates stored seperately and map to row and col numbers only- Each variable in a seperate table with symbols for rows and dates for columns- Really bad for where data maps on disk.- Data already in correct format for RAverage query for a normal sample into R: 1-3 minutesAll of these are too slow so decided to go with gzipped binary RData workspaces containing the variables which is about 5 sec to load locally and 20 sec over the network.Is there anything I can do to make the database route come anywhere close to the binary file speeds?Any suggestions welcome.
Equity fundamentals database design
Posted: October 11th, 2011, 12:25 pm
by SierpinskyJanitor
BTW Hansi, you seem to be focusing on RDBMs here, is that a constraint, or would you consider something objet-based instead? In fact, the best business DBs in Equity Derivs I have worked for were built under this paradigm.
Equity fundamentals database design
Posted: October 11th, 2011, 1:18 pm
by Hansi
The basic options are MSSQL, Oracle and Sybase. That is only RDBMs. I could go for something else if I can justify it by being super fast.Any suggestions? Maybe one of the tabular nosql ones? An object would then be an variable? How does that scale in terms of additional symbols + dates? Again I iterate, databases are not my forte.
Equity fundamentals database design
Posted: October 11th, 2011, 1:49 pm
by SierpinskyJanitor
Hansi, apologies, but are you posting questions about a specific problem you're tackling in a REAL production environment? If this is the case, and if this is business sensitive, you must at least tell your manager that you're leveraging from this platform. Otherwise this is borderline unethical.
Equity fundamentals database design
Posted: October 11th, 2011, 2:13 pm
by Hansi
I mis-understood.
Equity fundamentals database design
Posted: October 11th, 2011, 2:17 pm
by SierpinskyJanitor
no Sir, you have in fact misunderstood my reply:It is not ethical to post PRODUCTION related issues in open forums, regardless.
Equity fundamentals database design
Posted: October 11th, 2011, 2:45 pm
by Hansi
You are right, too much context was given. I edited the text above to focus only on the issue at hand.
Equity fundamentals database design
Posted: October 11th, 2011, 2:55 pm
by SierpinskyJanitor
Thanks Hansi, one is never too careful with these things. Sometimes not even our managers are fully aware of these operational constraints and fall carelessly within audit´s radar scans... nasty consequences ( personal experience here ). By raising this issue and reediting we are actually setting standards and giving a good example since the whole forum is jeopardised in case of trouble. OK let´s refocus now.
Equity fundamentals database design
Posted: October 11th, 2011, 11:03 pm
by rmax
Without seeing the schema it is going to be tricky to give good advice, but here goes:It sounds as though you have performance issues on the data extract rather than the data load. Hence would suggest that you look at your query plan and see where it is sub optimal. On this you want to put a clustered index. On the other fields I would put a non-clusted (focused on what you are trying to extract).Variables in a separate table sounds very sub-optimal, as you are going to have to at least do mutliple index-seeks+hash joins to pull relevant data together.This is the tip of the iceberg for an RDBMS. If you want anything more then I would need to know specific details.
Equity fundamentals database design
Posted: October 12th, 2011, 7:33 am
by assylias
This is more of a hack than good DB design but I have once seen a DB where price tables had 366 columns containing the data for each day of a year. It is quite ugly but can be much faster than having a typical 2 columns [date, value] structure.So it looks like:[year, value on day 1, value on day 2, value on day 3, .... , value on day 366]It requires some cleaning when retrieving data to remove empty days.This could easily be transposed to monthly frequency, but the performance gain might be smaller.
Equity fundamentals database design
Posted: October 12th, 2011, 8:43 am
by Hansi
QuoteOriginally posted by: assyliasThis is more of a hack than good DB design but I have once seen a DB where price tables had 366 columns containing the data for each day of a year. It is quite ugly but can be much faster than having a typical 2 columns [date, value] structure.So it looks like:[year, value on day 1, value on day 2, value on day 3, .... , value on day 366]It requires some cleaning when retrieving data to remove empty days.This could easily be transposed to monthly frequency, but the performance gain might be smaller.That's similar to option C above. This is quite sub optimal because it scales badly as when you are adding in columns and rows they are all over the hdd so as it grows the seek time for the query grows really badly.
Equity fundamentals database design
Posted: October 12th, 2011, 8:57 am
by Rufus
You could look at:
http://kx.com/They have a 32-bit evaluation for download.
Equity fundamentals database design
Posted: October 12th, 2011, 9:06 am
by Hansi
QuoteOriginally posted by: rmaxThis is the tip of the iceberg for an RDBMS. If you want anything more then I would need to know specific details.It's more so about what method/schema would be optimal to store this in order for it to be as fast as possible to pull into R with the ability to select only specific symbols and months.It might make more sense with this example of 5 random variables. So basically it's 5 matrices that contain doubles in the below format that I want to store in a database and be able to query back out with a select number of symbols and months at a time to reduced sized matrices in R:
Equity fundamentals database design
Posted: October 12th, 2011, 9:11 am
by Hansi
QuoteOriginally posted by: RufusYou could look at:
http://kx.com/They have a 32-bit evaluation for download.If I try a column oriented approach is it better to set symbols or dates as columns? I'm assuming it's symbols?
Equity fundamentals database design
Posted: October 13th, 2011, 7:28 am
by rmax
Starting from scratch without any input I would place it in a table structure that is:DateSymboldatatypeValuePut a clustered index on Date,SymbolIndex on DatatypeSuggest you crack open the query plans and see where the performance hit is. Are you sure that it is the data seek, and not the driver from RDBMS into R?I think you should then be able to optimise by placing different tables on different disks to increase performance. Inherently I feel that Oracle/MS Query experts should have a better algorithm for finding data than something that is put into R.Would then be tempted to create a materalised view (or whatever they called are now) that breaks up the data into how you are going to query it. Should be a safe way to go as the data is the historic data is pretty static.Have never looked at the MapReduce algos, but that might be something you can look at as well. but a postage stamp would give me ample space to write down my knowlege of these.