Database setup recommendation

February 23rd, 2016, 06:23 AM

treydog999

Optimally I would collect each depth of market change, so that i could estimate order cancellation and modification rates. If it do it after each trade, then it would basically just be a random snapshot.
...

You should start by doing some basic statistics and count how many changes/second you have for each tracked instrument.
It's easy using the Ninjatrader OnMarketDepth() for example, then plot the results.
I'm afraid that a single server won't be able to collect 500 symbols level 2 updates, and this is before writing the data somewhere.
Nanex do this, NinjaTrader too for their Market Replay, but not on a single server I think

.

February 23rd, 2016, 09:07 AM

I would never use HDD for this, only SSD

Sent from my phone

February 23rd, 2016, 09:56 AM

I just checked very quickly on ES 03-16 15 minutes after the open today, about 2000 market depth updates/second maximum, the average seems to be around 300/second.
Okay, it's liquid instrument, but if @treydog999 wants to deal with 500 symbols this going to be tricky to manage this kind of frequency.

February 23rd, 2016, 06:02 PM

sam028

I just checked very quickly on ES 03-16 15 minutes after the open today, about 2000 market depth updates/second maximum, the average seems to be around 300/second.
Okay, it's liquid instrument, but if @treydog999 wants to deal with 500 symbols this going to be tricky to manage this kind of frequency.

That does seem like a lot. I appreciate you taking a look at that for me. What if I did not need to collect the data myself? For example buying it from a tick warehouse / professional data provider. Instead just used this database to store the data after the close and run queries against. That may be a better alternative for me as I definitely will not be able to handle 500 instruments that way, maybe not even 100.

February 23rd, 2016, 08:25 PM

I think you should build your software solution first, then scale up as needed. Don't worry about hardware needs for now.

So, get data for, say, 10 symbols. Write your apps or queries that are going to read and write this data. Use it for a while, work out the bugs. Keep adding symbols until your exiting hardware starts to struggle, then that will give you a good idea on how high-end your hardware solution needs to be.

In general, my philosophies towards hardware are
- SSDs always, at least for anything where response time matters. You can use HDD for archive and backups. You can get a 2TB SSD for $600+ now.
- The more RAM the better
- RAID 10 is optimal, at least do RAID 0 for speed and backup transaction logs to a separate drive. You only need to upgrade to RAID 1+0 if you can't miss a single transaction ever in your data stream. If you can re-load it later from an outside source then you don't need RAID 1, just restore from a transaction log backup and reload data from your outside source anything after your last backup.

Initially, I'm guessing you want to play with and research this set of data and come up with some algos based on the historical data, so the always-on, real-time aspect of the server is not crucial if you can reload it later from an outside source. Once you move to trade live money with it then the real-time uptime will be crucial and you'll need to implement the highest degree of speed and availability.

And I don't think mongoDB or any other noSQL solution is right for you. Those are more appropriate for large, unstructured data structures (video, documents, etc.) but if you're dealing with a deterministic structure of numbers then SQL will be superior for you both storage and application-wise (query-ability). However, noSQL has superior horizontal scaling capabilities but poor support for ACID transactions. For more see this SQL vs. noSQL article.

February 23rd, 2016, 08:43 PM

shodson

I think you should build your software solution first, then scale up as needed. Don't worry about hardware needs for now.

So, get data for, say, 10 symbols. Write your apps or queries that are going to read and write this data. Use it for a while, work out the bugs. Keep adding symbols until your exiting hardware starts to struggle, then that will give you a good idea on how high-end your hardware solution needs to be.

In general, my philosophies towards hardware are
- SSDs always, at least for anything where response time matters. You can use HDD for archive and backups. You can get a 2TB SSD for $600+ now.
- The more RAM the better
- RAID 10 is optimal, at least do RAID 0 for speed and backup transaction logs to a separate drive. You only need to upgrade to RAID 1+0 if you can't miss a single transaction ever in your data stream. If you can re-load it later from an outside source then you don't need RAID 1, just restore from a transaction log backup and reload data from your outside source anything after your last backup.

Initially, I'm guessing you want to play with and research this set of data and come up with some algos based on the historical data, so the always-on, real-time aspect of the server is not crucial if you can reload it later from an outside source. Once you move to trade live money with it then the real-time uptime will be crucial and you'll need to implement the highest degree of speed and availability.

And I don't think mongoDB or any other noSQL solution is right for you. Those are more appropriate for large, unstructured data structures (video, documents, etc.) but if you're dealing with a deterministic structure of numbers then SQL will be superior for you both storage and application-wise (query-ability). However, noSQL has superior horizontal scaling capabilities but poor support for ACID transactions. For more see this SQL vs. noSQL article.

Thanks for the comments. Have you heard of KERF? just curious as it is new for time series data. Not well known or battle tested but since you seem knowledgeable i thought i could get your opinion.

In regards to using this for development and research you are bang on. Our production servers use a totally different system that is outsourced solution. I am doing some basic research to see what can be brought in house.

The negative to that is most of the tick data suppliers for large quantities of what i need usually do a 1 time data dump and then FTP uploads daily for updates. I would not be able to get that back by going back to my original data source. For example Bloomberg uses this methodology and that initial data dump is not cheap. So losing that at all would be disastrous, 100x multiple of the cost of my test server. So raid 1+0 is probably the way to go for me. Thoughts on raid 5?

February 23rd, 2016, 10:09 PM

treydog999

The negative to that is most of the tick data suppliers for large quantities of what i need usually do a 1 time data dump and then FTP uploads daily for updates. I would not be able to get that back by going back to my original data source. For example Bloomberg uses this methodology and that initial data dump is not cheap. So losing that at all would be disastrous, 100x multiple of the cost of my test server. So raid 1+0 is probably the way to go for me. Thoughts on raid 5?

Store both the raw files from the FTP dump and the database derived from the raw files. They can just be in two different directories in the same file system.

If you want a cheap and quick backup of your raw files from the FTP dump on top of the copy that resides on your database server, since these are static, you can probably do it for real cheap with Glacier, or worse, a plug-and-play NAS or an external HDD.

Database setup recommendation

Discussion in Tech Support

Database setup recommendation