Welcome to NexusFi: the best trading community on the planet, with over 150,000 members Sign Up Now for Free
Genuine reviews from real traders, not fake reviews from stealth vendors
Quality education from leading professional traders
We are a friendly, helpful, and positive community
We do not tolerate rude behavior, trolling, or vendors advertising in posts
We are here to help, just let us know what you need
You'll need to register in order to view the content of the threads and start contributing to our community. It's free for basic access, or support us by becoming an Elite Member -- see if you qualify for a discount below.
-- Big Mike, Site Administrator
(If you already have an account, login at the top of the page)
I am considering replicating something like @Big Mike did using mariaDB and toku, or possibly going mongoDB. I have very little idea about what would fit my needs. As big mike's set up as shown in his thread handles like 6000 instruments. Which is definite overkill for me at this point.
So lets get my requirements down. I am going to be collecting tick data for about 500 instruments maybe 1000 after some development. We only have 5 levels of depth not 10 so that should save some space, some instruments are only level 1. I will also save minute and daily data as well.
I have a decently sized budget around 5k for this. Not including a 1.5 year old ThinkServer with a Xeon E3-1200v3 which I am going to purpose for this.
How many HDDs do I need? Raid 5 or 10?
Is going SSD vs HDD worth it for saving tick data for 500 instruments?
How would i set that up? 1 hard drive for the OS 1 for the logs and the raid array for storage?
I am going to guess going Linux (ubuntu) is going to be better than windows as well.
How much storage space should i look buy?
How important is RAM the ThinkServer only has 8GB at this point?
Sorry I really have no idea about anything here. I could really use some help.
Thanks in advance
Can you help answer these questions from other members on NexusFi?
Depends on the type of queries that you are going to run. My answer is most probably not.
Depends on the memory-intensiveness (caching mechanism, index operations) of the database that you're using. My gut feeling is that your database application will probably need 32-64 GB without knowing more details.
You've answered this yourself:
Probably give yourself around 20M x 500 per day = 10G per day, 5 TB to last 2 years. Your budget is plenty.
RAID 10 vs 5 is a business decision first: How much are you willing to pay for faster performance? 10 is generally 'faster' than 5, at the trade-off of space. You can easily get 5 TB with 3 disks in a RAID 5 or 4 disks in a RAID 10, so it's up to you.
Make sure you get a server with enough bays for all the disks you need and extras (you will care about this later).
What logs? Database WAL/xlog? Depends on the database you're using - but most probably yes, splitting the drives would help speed it up.
Linux vs Windows: It depends on the rest of your ecosystem... If you're using something like MS SQL Server, then Windows has first class citizen support. If the rest of your machines are in a Windows environment (Active Directory, RDC etc.), then it's probably convenient to use Windows. Otherwise, Linux.
There are two many unknown parameters, IMVHO, to give good answers, so the first think to do might be to choose the database type, SQL or noSQL.
You can try to think how you're going to get your data from a noSQL database, that may help to see if i's a good idea or not.
To collect 5 levels from 500 instruments will require a very good write throughput: forget about the couple Raid 5/HDDs! Or maybe with very good Raid card with at least 2 GB of cache.
I'll do SSD in Raid 0 (or Raid 10) for live collect, then archive collected data on cheaper/bigger HDDs in Raid 5, as write performance won't be a problem in this case.
About the RAM the more the better, but the Xeon E3 are limited to 32 GB.
You don't want to collect each depth of market change, only the depth of market status after a trade occurred, correct?
Optimally I would collect each depth of market change, so that i could estimate order cancellation and modification rates. If it do it after each trade, then it would basically just be a random snapshot.
Just to help fix some of the parameters. Lets say I am going with the exact same set up as @Big Mike and using MariaDB and Toku. That will make it more familiar and probably just easier to use as SQL is more commonplace.
I really appreciate this answer. Helped me get some basic estimates and what I should be looking for, as well as giving me a frame of reference. Thanks as always @artemiso
Storing data is only a small piece. Using it is more complex (timely queries).
My dual Xeon E5 with 128gb memory and raid 10 ssds with 6gb sas hardware raid is barely up to the task, I would advise you do months of proof of concept testing on a small scale before deciding what you really need in a larger deployment.
You make a really good point on the small scale and usage is very important (timely queries).
Is the upgrade from say 10k HDDs to SSDs worth it? I know that you get 10x IOPS at about the 10x the price. Also the price of enterprise great hardware is significantly more expensive. Is that worth it?
Using maria/toku what are the tweaks or hardware that you think improves timely queries the most?
Enterprise hardware is almost certainly desired if (i) you are going to plug it in a remote datacenter and leave it alone and inaccessible for a very long time and (ii) you have a lot of hardware working together in concert and (iii) you have a lot of data to write and re-write. (There's another, 4th reason why enterprise hardware is expensive that I will talk about at the end, that will probably be irrelevant for this discussion.)
It's a matter of probability. Hardware has a certain mean time before failure and standard deviation of time before failure; enterprise hardware generally has a larger mean and smaller standard deviation. Even the cheapest retail hardware will probably provide you something like 57 years mean time before failure. You can quantify whether it matters to you: If you have 4 disks and your server is in your closet vs if you have 400 disks at 8 hours' travel time away. For most people, enterprise hardware is not necessary.
For example, I'm in a large city where it is costly to put (a very large number of) servers within driving distance from the city. Instead, most of my firepower is in an obscure city that I've only been to once in my lifetime. 1 day of downtime and sending a sysadmin down to fix things is a lot costlier than the additional cost of enterprise hardware, so the price becomes justified immediately. In another case, I have servers about 10k miles away from my location, it's an absolute certainty that you'd rather spend a bit more on a better insurance policy (enterprise hardware).
(The fourth reason that enterprise hardware is expensive - and in this case I mean EMC etc. is that it gets very, very expensive to extract acceptable performance after your first 1 petabyte.)
Regarding RAID5 vs RAID10, I was always under the impression that RAID 10 is also the "safer" option. Couldn't really find the article I was looking for, but here is something similar - RAID 5 Vs. RAID 10. With RAID5, you run the risk that you lose your entire dataset during a rebuild. With RAID 10, you still have a risk when 1 disk is being "cloned" after a failure, but that is a read-only operation (on the surviving drive) which places less stress on a disk vs the full read-write operation of RAID5 rebuild. Bear in mind though that the last time I investigated this was 2010/2011 and I have used RAID5 (actually RAIDZ1) since then without issue in my iTunes server, but it runs a much lighter workload than your proposed database.
You also need to budget for a backup solution. RAIDs can fail and your data can be lost. If you want to have snapshots (like Apples's Time Machine), then it would be good to have more storage space in your backup solution than in your database server. I would aim for about double the size in this scenario. If you do not need snapshots then same size is fine.
Also bear in mind that the less free space you have in your array, the more the workload increases stress across your drives. In my server there is a very definite drop in performance once the free space drops below 50% and once it drops to 20% things slow to a crawl. So, always ensure that you budget for sufficient size now. While you can increase size later with RAID5 (not sure about 10), every time you add drives the array rebuilds itself which means you run a higher risk of failure.
Seems mostly about right. Also have to add that 10 gives you the ability to lose half of your drives in the best case scenario as opposed to only 1 drive.