algo-trading setup

pentagram · January 1st, 2011, 2:37 pm

I'm trying to design & implement an ATS (for home use - not an HF super-techno black box) and I would appreciate any advice on design issues - I don't want to fully implement something only to find out I was working in the wrong direction. Reading previous threads on wilmott.com was very helpful and I hope the setup is reasonable. Essentially the whole system will be broken into two boxes: 1) database & analytics box 2) order execution box.The order execution box will talk to the broker, see how the orders are best placed and receive the orders themselves from the "db & analytics box" via xmlrpcThe database & analytics box will have the following componentsa) a local non relational DB, probably BerleleyDB, maybe HDF/pytables which will store historical data and will also fetch these data from a data provider, e.g. eSignal (I want all storage to be on the same box with the analytics engine in order to have faster processing & reduce costs)b) the analytics engine which will process the DB and e.g. find pairsc) a portfolio engine that keeps track of my position and gets input from the analytics engine. This engine will send signals when to enter a position, when to exit and how to manage risk for each trade (e.g. stops) Since the analytics engine will be doing some heavy number crunching, I don't want this to eat resources from the order execution system and since the order data shouldn't be too big, I assume they can be sent over a local network.Is the two-box setup and the task-partitioning design reasonable? am I better off using one box? are there any scaling issues I haven't even thought about? any additional advice?Thanks in advance

AKalmykov · January 1st, 2011, 4:09 pm

the main rule of thumb is - *don't* optimize until you really have to. don't put a second box until you will see that one box is not enough. just use sockets for communication between analytic and execution engines so it will be easy to use 2 boxes if needed.non relational DB? are you sure that you really want this (i.e. do clearly realize the benefits and drawbacks)?

pentagram · January 1st, 2011, 4:15 pm

Thanks for your reply, I'm not a DB expert (though I have done some formal training in relational DBs and worked with DBs in the past). The reason I chose to work with non-relational DB is that I don't see any advantage in having a relational DB for storing time series. The reason I don't plan to use a remote DB but have it all on the same machine with the analytics engine is that an analytics engine which fetches whole GBs from a DB will be faster if the DB is on the same drive instead of using a network db (cheaper as well).edit: btw this is exactly why I made this topic, any criticism on this setup is more than welcome (in fact it is asked for!) and the harder the criticism, the better. I'd much rather have somebody point out flaws now than find them myself in 8 months, after of hours of coding QuoteOriginally posted by: AKalmykovthe main rule of thumb is - *don't* optimize until you really have to. don't put a second box until you will see that one box is not enough. just use sockets for communication between analytic and execution engines so it will be easy to use 2 boxes if needed.non relational DB? are you sure that you really want this (i.e. do clearly realize the benefits and drawbacks)?

AKalmykov · January 2nd, 2011, 2:53 pm

QuoteThe reason I chose to work with non-relational DB is that I don't see any advantage in having a relational DB for storing time series.But what are the advantages of non-relation db?QuoteDB will be faster if the DB is on the same drive instead of using a network db (cheaper as well).Cheaper - yes, faster - (should be) no (compare 6 Gbit/s SATA vs100 Gigabit Ethernet). In my opinion you should use non relation DB only if at least one of the following points are true1) your have terrific amount of data (>>10TB)2) your data is essentially "schema-less"3) all your database consist of several huge tables and you don't need to run any complex queries on several table (e.g. joins) 4) you need to execute heavy and full scan queries (like "full text search")I think non of above holds for your application. You can run relational db local or remotely. It doesn't matter. Also note that your expertise is very important when you choose your DB platform. Stick to something you have experience with or at least you have some good knowledge base. You can easily find a good book (or find some expert) how to work with and optimize Oracle or MySQL. I doubt that such reference exists for non relational db (but I could be wrong). Good luck.Disclaimer: I'm not non relational DB expert. But I'm quite proficient in old school relational databases. So I'm biased.

pentagram · January 2nd, 2011, 5:14 pm

At first I thought of making my own binary format for the t-s, still I want to reuse as much already available code as possible so I thought of using non-relational dbs. My (limited) understanding is that non-relational dbs essentially do that but I stand to be corrected, I'm not an expert in dbs. Regarding the data size, the initial data won't be that big. However, if I can implement a setup that scales with size, I will chose to do that. I guess that people who store tick data e.g. from eSignal do have tick-dbs of 10tb. Some years of tick data (for a couple of industry groups) could reach that(?). Again, I'm not 100% sure how to do this(*), hence this post if someone can share how they store their data (for backtesting purposes) I'll be very happy to hear how they worked out this part of their system.(*) e.g. I'm not aware of typical sizes for tick-data dbs if you want to trade two stock-exchanges, what is the typical size for fx dbs, what is the typical size for commodities dbs etc. Also how many GBs per day do such dbs typically increase? Also does it matter much if a db supports transactions or not? 10TB is super big for non-institutional setups today (in 3 yrs it may be ok) do people use storage devices from eg Dell for this sort of job or raid'in "normal" 2TB hdds are ok? The points you make about documentation are very concise and also some existing relational DBs have a huge userbase and are unlikely to cease in a couple of years. I have no clue how nosql will be in 5 years from now but I'm pretty confident that PostgresSQL and MySQL will still be around in one form or another. I had worked with MySQL in '02-'03 so while it probably has changed drastically, possibly some things I remember are still there. To make what I'm trying to do more clear, the purpose of this db will be to hold data for backtesting purposes. It will store timeseries of tick data and the backtesting platform will fetch these to check eg for pairs trading. If there are other aspects of a historical data db (e.g. more queries except fetching whole t-s from date1 to date2) that I should be considering, please do let me know!QuoteOriginally posted by: AKalmykovQuoteThe reason I chose to work with non-relational DB is that I don't see any advantage in having a relational DB for storing time series.But what are the advantages of non-relation db?QuoteDB will be faster if the DB is on the same drive instead of using a network db (cheaper as well).Cheaper - yes, faster - (should be) no (compare 6 Gbit/s SATA vs100 Gigabit Ethernet). In my opinion you should use non relation DB only if at least one of the following points are true1) your have terrific amount of data (>>10TB)2) your data is essentially "schema-less"3) all your database consist of several huge tables and you don't need to run any complex queries on several table (e.g. joins) 4) you need to execute heavy and full scan queries (like "full text search")I think non of above holds for your application. You can run relational db local or remotely. It doesn't matter. Also note that your expertise is very important when you choose your DB platform. Stick to something you have experience with or at least you have some good knowledge base. You can easily find a good book (or find some expert) how to work with and optimize Oracle or MySQL. I doubt that such reference exists for non relational db (but I could be wrong). Good luck.Disclaimer: I'm not non relational DB expert. But I'm quite proficient in old school relational databases. So I'm biased.

Stale · January 3rd, 2011, 10:59 am

Hi,How will you keep track of the executed trades?? Other system-parameters? I think you will end up using some kind of database anyway, so why not go for SQL? Also, which language will you use to implement this? ORMs are quite neat for implementation speed at least.Stale

winstontj · January 3rd, 2011, 3:12 pm

Pentagram, How will you execute? Have you looked at software like TradeLink? http://code.google.com/p/tradelink/ Its free and open-source and integrates with many platforms. It also has a pretty good backtest/simulation module that can be easily modified. For what its worth, if you are running a home system you should be able to do this all on one simple Intel Q9650, i5-650 or Xeon x5450. No need to have a crazy box or multiple boxes - all of this stuff is very small and fairly lightweight in terms of system resources. You are over-engineering this... its much easier than you think.

pentagram · January 3rd, 2011, 4:24 pm

Many thanks for pointing to Tradelink, I can see that it has an execution management system, it has support for backtesting , connection to Data sources, charting and viewing tick files. Under "ASP" it also says it supports storing of tick data in what format does it store the data? Also what "typical" storage capacities are needed to store tick-files? is 1-2 TB good enough?QuoteOriginally posted by: winstontjPentagram, How will you execute? Have you looked at software like TradeLink? http://code.google.com/p/tradelink/ Its free and open-source and integrates with many platforms. It also has a pretty good backtest/simulation module that can be easily modified. For what its worth, if you are running a home system you should be able to do this all on one simple Intel Q9650, i5-650 or Xeon x5450. No need to have a crazy box or multiple boxes - all of this stuff is very small and fairly lightweight in terms of system resources. You are over-engineering this... its much easier than you think.

pentagram · January 3rd, 2011, 4:26 pm

tradelogs are different tho, I don't need to store them in the same place as tick data.QuoteOriginally posted by: StaleHi,How will you keep track of the executed trades?? Other system-parameters? I think you will end up using some kind of database anyway, so why not go for SQL? Also, which language will you use to implement this? ORMs are quite neat for implementation speed at least.Stale

tradelink · January 3rd, 2011, 4:49 pm

pentagram,TradeLink stores tick data in a binary format that allows for fastest possible playback (250,000-800,000 ticks per second) :http://code.google.com/p/tradelink/wiki/SpeedTestsJust like everything in tradelink, this format is open source :http://code.google.com/p/tradelink/sour ... ants.csFor level1 equities data, average storage requirements for is between 2 and 10MB per equity symbol per day. Other securities types tick less so they should require less storage. If you store on compressed drive you should get 10-50 times compression. Here are other supported sources of data in TradeLink beyond live tick capture :http://code.google.com/p/tradelink/wiki/DataSupport

vincegata · January 5th, 2011, 5:09 pm

@tradelink 1) I see you use C#, is all of your code in C#? Would it be difficult to port your C# code to C++ (given I am good at both) ? I just think that an ATS should be implemented in C++ for speed, and on Linux/UNIX platform for speed and stability. Seems like that's the combination the big players use for the production. 2) Could you explain how to download the source code. I see this under the source code tab, but could you elaborate or point to a more detailed instructions.Use this command to anonymously check out the latest project source code:# Non-members may check out a read-only working copy anonymously over HTTP.svn checkout http://tradelink.googlecode.com/svn/trunk/ tradelink-read-only Thanks,

Hansi · January 5th, 2011, 5:44 pm

QuoteOriginally posted by: vincegata@tradelink 1) I see you use C#, is all of your code in C#? Would it be difficult to port your C# code to C++ (given I am good at both) ? I just think that an ATS should be implemented in C++ for speed, and on Linux/UNIX platform for speed and stability. Seems like that's the combination the big players use for the production. Porting whole projects from C# to vanilla C++ is a pain because generally the C# code is built around using .NET and replacing those components is a hefty chore. Calling them from C++/CLI is pretty much going to give you the same results speed wise I would assume because the managed C++/CLI is going to go to CLR anyway same as the C# code. If working on a one man show I think tradelink is fast enough on it's own and if you want to run it under unix then Mono should provide compatibility for most if not all of the stuff being used in tradelink. (@tradelink can give a better response on whether or not it's possible since my view is only based on a quick glance at the code).QuoteOriginally posted by: vincegata2) Could you explain how to download the source code. I see this under the source code tab, but could you elaborate or point to a more detailed instructions.Use this command to anonymously check out the latest project source code:# Non-members may check out a read-only working copy anonymously over HTTP.svn checkout http://tradelink.googlecode.com/svn/trunk/ tradelink-read-only Thanks,That command will checkout a copy of the source code from the subversion server. So basically just install SVN and cd to the directory you'd like to be the parent dir for the source code folder and then run that command.

vincegata · January 5th, 2011, 10:09 pm

Thank you

tradelink · January 5th, 2011, 10:32 pm

Hansi is correct both in his SVN instructions and also that porting to c++ is only going to reap measurable returns in very specific cases..net is not an interpretted language, so unless you are using reflection or using a very slow part of .net, you're not going to see much difference between compiled c++ and cli bytecode. (although this is a large and complex topic so I don't want to start a war over my superficial addressing of this point, lol).That said about 30% of tradelink is in c++ and most of the core objects are already duplicated in c++, so you can write native c++ applications in addition to managed c++ through .net. We don't generally see speed improvements doing this (in some cases it has been slightly slower) and you give up features so use it with caution.regarding mono, yes it should be possible to run the .net components in tradelink on mono. To my knowledge there is no one doing this presently, but we would welcome making this easier and help anybody who wanted to do so. Some of the broker connections require 3rd party libraries which will only run on windows platforms though (eg COM interfaces), so you should be aware of this. TradeLink does support tcp/ip as a transport protocol though, which would allow you to run these particular connectors on a windows machine and connect to them via other platforms. It does this at 40,000 ticks a second which represents average latency of .05 milliseconds aka 50 microseconds.