Welcome to NexusFi: the best trading community on the planet, with over 150,000 members Sign Up Now for Free
Genuine reviews from real traders, not fake reviews from stealth vendors
Quality education from leading professional traders
We are a friendly, helpful, and positive community
We do not tolerate rude behavior, trolling, or vendors advertising in posts
We are here to help, just let us know what you need
You'll need to register in order to view the content of the threads and start contributing to our community. It's free for basic access, or support us by becoming an Elite Member -- see if you qualify for a discount below.
-- Big Mike, Site Administrator
(If you already have an account, login at the top of the page)
Just a simple comparison. To load data (ticks, candles - irrelevant) from local DB (using named pipes as a fastest inter-process communication) will be slower in 100-200 times than read it from regular CSV file.
Market data is not oriented for relation databases. First of all - there is no any relations. The pricing data like a streaming video - oriented on raw data format (text or binary). The fastest way - upload 1-2 weeks in tick data and use it memory.
Wanna to upload it into some cloud services? Pretty easy. We use AWS for iteration backtesting (optimization). Allocation at the same time ~20 servers with replicated market data. Cheapest data storage - S3. Replicating it to EC2. As a blobs for sure. No any SQL queries or something like that can kill the performance.
DB has a great potential but not with a market data and backtesting.
Broker: NT Brokerage, Kinetick, IQFeed, Interactive Brokers
Trading: ES
Posts: 159 since Dec 2014
Thanks Given: 40
Thanks Received: 166
There is some fantastic real-world experience coming out in this thread! Regardless of the approach, there is information in here for a wide array of users. Great thread guys.
Sorry for the extreme hypothetical, but I'd doubt Facebook can run their data collection/storage back-end off of CSV files
There are more than several builds that individual traders in this thread have alluded to as using DBs (Cassandra, Mongo, etc), and if in context of financial market data, going flat-file CSV is faster, cheaper, and just as easy or easier to scale to some infinite horizon, why would anybody deploy a DB?
I'm going to be pulling unfiltered, streaming tick data for some of the most active futures instruments, such as Crude Oil, E-Mini S&P, etc...the number of rows will be in the tens of millions rather quickly. I get that reading from / writing to a CSV is a lot simpler & faster than a DB table, but when we're talking large-scale setups, seems this argument won't be able to keep up with the reality of the demand.
For backtesting? Actually I didn't seen any commercial trading app that keep market data in DB. NT, MC, SC, TS - they use they own format of a flat files.
Sure. CSV like a binary format - just a files. They are not a silver bullets. I'm sure there is hundreds and thousand cases where DB works brilliant than just files. Even in trading and backtesting. For my cases (running data on local environment or use a cloud services) raw files more adapted.
My point it this thread - to make a seed of doubt. And terms "Data" and "Database" are not equals. Right choice depends from many-many details (storage type, count of users, distributed, parallel working, fast reading or writing, etc.).
A DBMS and a flat file are not all that different. Many DBMSes are just executables that sit on top of a flat file on your file system. You can always roll your own and achieve comparable functionality, but if you find yourself replicating most of the functionality of a DBMS, then chances are that years of painstaking B-tree optimization etc. are superior to your own solution.
The truth is, platform designers for NT, MC, SC, TS etc. are 20 years behind in their design choices because their users don't run into serious use cases. On the other hand, a company like FB has many flexible and unstructured use cases for their data that they have not yet discovered. The principle of 'big data' in a firm like FB is that they're collecting more data than they can churn analytics on for months to come, whereas most retail traders can backtest through their entire collection of data overnight. So you should let neither sway your decision on what to do.
I can name several use cases where a DBMS is superior:
1. Administrative privileges. This isn't just a matter of having multiple people in a company that you're working with. Even in a 1 man shop, it is poor taste for your backtesting application to have same write privileges as your storing application.
2. Constraints. For example, let's say you're storing EUR/USD data for FXCM and then you decide to trade EUR/USD on Hotspot. Now maybe you want to rename them to EUR/USD.FXCM and EUR/USD.Hotspot and modify all the associated data (tables). It's easy to make this modification with a `CASCADE` operation, but troublesome if you rolled your own solution with flat files. Or let's say you rolled a software change to the application that is storing your tick data last Thursday at 5 PM ET, and after a backtest today, you realized that all your data since last Thursday 5 PM ET was glitched. Maybe all of it has incorrect rollover offsets. With careful planning in your schema and a few statements in your DBMS's DSL, you can fix this.
3. Indexing and range selection. Let's say you're trying to select data from 3 AM on Aug 15, 2014 to 4.15 PM on Aug 17, 2014 for a backtest. Walking through the array that backs the file system directory structure is cheap because you probably don't have a large number of files, however at some point you're going to have to walk through the timestamps. It takes O(2 log n) to do both endpoints, but many CSV parsing modules/libraries are agnostic to sort order and will brainless walk O(2n) through this. Because your handrolled solution is almost surely in row major order, you're paying as much as 2 days of backtesting in I/O cost to qualify your data for an actual loop for 3~ days of backtesting. Similarly, hash indices make short work of certain useful queries ("fetch me all ticks at support/resistance level"). It's ugly to serialize a B-tree or hash index blob in CSV or your own binary format to compete, and that's code that doesn't bring you joy or money.
4. One-stop shop for redundancy and backup. It's entirely possible to do backup with RAID and a bunch of Bash scripts (rsync etc.) but for many users, it's cleaner to take care of backup entirely at the DBMS application level. Newer DBMSes ensure you're backed up on commit and can fall back on your backup in a hardware-agnostic way. You could have 2 separate RAID0 machines and it would still work.
Maybe TimescaleDB should be considered (it's a PostgreSQL extension to deal more efficiently with time series (search "TimescaleDB vs. PostgreSQL for time-series")
I checked out the site and I am wondering what practical difference is there in the different versions mentioned in the link, from the perspective of a normal retail trader point of view. If I use the FREE COMMUNITY VERSION then would that be good enough or do I need to think about the Enterprise Version ?
- https://www.timescale.com/products
Any one else who has tried out TimeScale DB, what is your experience in this regards ? Is it good for storing the TICK Data which gets increased by around 1 Million Rows for each trading session ?
Does it have proper analytic functions etc. which will help in analyzing the data properly ?
A lot of guys keep on saying that Relational Database are not good for storing Tick Data, but the TimeScale DB guys contest that point of view and say that there database is one of the best one's out there for managing timeseries data in the current times. All this gets me confused...
I cannot keep the data into separate flat files etc. as some friends have mentioned here. I need to have all this tick data imported into the db on a daily basis. If anyone is practically doing something similar, then please share your views about which particular db are you using.
In this thread have been listed enough alternatives to make a decision. I think that depends on the tool you use to analyze the data.
If you use a third-party tool, the type of DB to be used is not relevant because it does not make sense to use a different one.
If you make your own tool, then you must choose the DB/Library that better fit your goals, and here a lot of variables come in.
Being simplistic, an example would be to use TeaFiles for single data series. To make decisions from multiple data sources a relational db could be used, but this could be achieved through code, too, using single series.
There is no one solution, analize your goals and test and measure the tools better fit.