Welcome to NexusFi: the best trading community on the planet, with over 150,000 members Sign Up Now for Free
Genuine reviews from real traders, not fake reviews from stealth vendors
Quality education from leading professional traders
We are a friendly, helpful, and positive community
We do not tolerate rude behavior, trolling, or vendors advertising in posts
We are here to help, just let us know what you need
You'll need to register in order to view the content of the threads and start contributing to our community. It's free for basic access, or support us by becoming an Elite Member -- see if you qualify for a discount below.
-- Big Mike, Site Administrator
(If you already have an account, login at the top of the page)
No: I have not made any steps other than connecting with the source code maintainer, and also trying to contract a local programmer to build a linux box with Barchart API connectivity to TREE. YES: I was trying to build exactly what MXASJ described in his 'Tick Database' initial post.
The programmer I actually met in real time (ie a local with many years experience varying from micocode controllers to neural nets) stopped responding to emails and phone calls after I passed on (to him) the TREE info and the Barchart.com (development tools logins) I had aquired for the project. I got discouraged and moved on. Too much to do: not enough time
The TREE source code provider has a contact info on the webpage (or I can pass it on to you [along with the only email I exchanged with him]): I contacted him to see if there was an API already built for TREE>>Barchart as I did not want to use the available IB API (as a data source IB is lacking as not all data is sent). Only the IB API is available.
Long term my plan was:
1. build the TREE server with a cluster or parallel configuration
2. use data for NT live (less stress on NT [lol] and historical for building neural net/CUDA
3. data exchange 24 hour delay basis for 'fills' with others that might have had disconnects etc
I want to bring this thread alive again, and I'm going to cheat and move it out of the NinjaTrader section because I think the thread itself could be platform independent.
I'm not a DBA so could use some advice. As I previously wrote in post #2, I had previously used NinjaTrader to store tick data into a MySQL database, but it was expensive. And I haven't even booted up Ninja in well over a year, in fact I don't even have it installed anymore, so I need a new way to do this.
Right now I have a huge amount of data in a proprietary database (MultiCharts), but I'd like to decentralize it and just have my own private database so I can import and export to it as I please.
At this stage, I am not trying to interface with an API to record incoming ticks live as they come, although that naturally is the ultimate goal. For phase 1, I'd just like to import raw data which I've already got (exported from existing platform) in tick form into a SQL db.
I think the database also needs to be smart enough to handle bid/ask sequencing, which means we need more than just instrument, date, time, last trade, and volume --- we need bid and ask plus a tickid sequencing (like dtn iqfeed).
An example:
So, who wants to help build this? I'm looking for advice on the db structure, storage engine, layout etc. I'm happy to share the data I've accumulated in exchange (Elite only).
Yes Mike, I'm interested in doing this kind of thing.
I can definitely help on the table structure.
I'm not so confident I know enough to say that using an SQL database is definitively a good idea, given the comments above, but I don't see why not if the database tables are correctly indexed. However I think it would be easy to test performance to see whether it meets expectations.
What documentation method do you use for database schemas? I used to use entity relationship diagrams that could output SQL, like Visio, but more recently I just write the schema create sql.
You can discover what your enemy fears most by observing the means he uses to frighten you.
Nothing so fancy, just the schema in sql directly. I guess I will need to see the "downside" with my own eyes for reasons to not use mysql, I really can't imagine that performance wouldn't be more than adequate. Main question for step 1 is just the table structure and if one storage engine makes sense over another (myisam, innodb, etc).
As far as I remember, innoDB gives you the ability to use relational integrity so you can lock down foreign keys to primary keys that actually exist. I think there are other engines that are better for faster access.
I always only used innoDB in team projects because it gives more control to keep the data in the database in order, by preventing anyone (inc. myself) from stuffing it full of orphan records, e.g. lots of records in the Contract table without a valid Market foreign key.
IIRC with innoDB, there's an extra indexing statement you have to add to apply the indices - no big deal though.
So how many tables are you talking? First off, do bid/ask ticks go in the same table as last ticks?
That's a completely FIRST atttempt! I guess you also want running volume total in there too if it doesn't get calculated by the app as you go along - but I'd be tempted to leave it out since a lot of charts won't be intereested in the time frame used to calculate the current total and would calculate it anyway. That's just a guess, I've never worked with volume.
It looks like the tick id is going to be your primary key value - as long as you guarantee the tick id is always unique.
Extra question - date/time - I would put the timestamp with date and time in one column since that's what the datetime data type is designed for. It might be better to stick with seperate date and time fields though if most software wants them that way. I don't know if it would be possible to merge 2 seperate fields at import - it might require a seperate SQL operations afterwards.
You can discover what your enemy fears most by observing the means he uses to frighten you.
I tried doing that for some months now on the side (besides a freaking 10 hour per day project) and I am giving up - makes zero sense. Well, at least for me. Storing ticks book 2 with all bid / ask changes for the complete cme group (around 600 million rows per day) is not somethign I think can be effectively done. Jsut to give you an idea - my current hardware is a dedicated SQ LServer ,16gb ram, 4 cores assigned (virtual machine, but I control the underlying hardware) with the storage consisting of 8 Velociraptors and one SSD.
The overhead is just too much.
I am now moving towards storing binary blobs for one base symbol + instrument type (example: YM-Futures) for a specific interval (one hour) and trying to use highly efficient extraction methods. This is a LOT more efficient in terms of storage
Were you storing in real time, I assume? That is not required for me in phase 1, I am just importing.
If the issue were table scans, perhaps a unique DB per instrument instead of one gigantic db? Again, not a dba, which is why I'm looking for input.
The server is a Dual Xeon 5430, 24GB. I'm looking for input as to what the issue was on your side, and how many instruments you were archiving. I plan to only archive about a dozen instruments, perhaps that is the major difference between us.
Another consideration is that MySQL timestamps do not support millisecond resolution. If millisecond resolution is needed, you cannot use a Timestamp field.
I remember that on SQLServer you can use table partitioning and automatically split your tables on your disks.
Also remember data rows are stored on disk in the order of the clustered index, so it must be chosen wisely, depending on your queries.
To add on performance issues, I'm not sure SQL is the way to go, it just eats up too much space if you use the basic data types. So you end up implementing your own binary database inside an SQL database, like NetTecture, and then you're using the SQL server more or less like a file system.
You could try the easy solution (brute insert in SQL, dunno what SQL server you will use but I'm sure they all have some sort of bulk copy for massive inserts), if you have performance issues you can always think of a more optimized solution.