mikabytes
mikabytes

Reputation: 2165

Database choice: High-write, low-read

I'm building a component for recording historical data. Initially I expect it to do about 30 writes/second, and less than 1 read/second.

The data will never be modified, only new data will be added. Reads are likely to be done with fresh records.

The demand is likely to increase rapidly, expecting around 80 writes/second in one year time.

I could choose to distribute my component and use a common database such as MySql, or I could go with a distributed database such as MongoDb. Either way, I'd like the database to handle writes very well.

The database must be free. Open source would be a plus :-)

Note: A record is plain text in variable size, typically 50 to 500 words.

Upvotes: 10

Views: 11143

Answers (2)

anon
anon

Reputation:

Your question can be solved a few different ways, so let's break it down and look at the individual requirements you've laid out:

  1. Writes - It sounds like the bulk of what you're doing is append only writes at a relatively low volume (80 writes/second). Just about any product on the market with a reasonable storage backend is going to be able to handle this. You're looking at 50-500 "words" of data being saved. I'm not sure what constitutes a word, but for the sake of argument let's assume that a word is an average of 8 characters, so your data is going to be some kind of metadata, a key/timestamp/whatever plus 400-4000 bytes of words. Barring implementation specific details of different RDBMSes, this is still pretty normal, we're probably writing at most (including record overhead) 4100 bytes per record. This maxes out at 328,000 bytes per second or, as I like to put it, not a lot of writing.

  2. Deletes - You also need the ability to delete your data. There's not a lot I can say about that. Deletes are deletes.

  3. Reading - Here's where things get tricky. You mention that it's mostly primary keys and reads are being done on fresh data. I'm not sure what either of these mean, but I don't think that it matters. If you're doing key only lookups (e.g. I want record 8675309), then life is good and you can use just about anything.

  4. Joins - If you need the ability to write actual joins where the database handles them, you've written yourself out of the major non-relational database products.

  5. Data size/Data life - This is where things get fun. You've estimated your writes at 80/second and I guess at 4100 bytes per record or 328,000 bytes per second. There are 86400 seconds in a day, which gives us 28,339,200,000 bytes. Terrifying! That's 3,351,269.53125 KB, 27,026 MB, or roughly 26 GB / day. Even if you're keeping your data for 1 year, that's 9633 GB, or 10TB of data. You can lease 1 TB of data from a cloud hosting provider for around $250 per month or buy it from a SAN vendor like EqualLogic for about $15,000.

Conclusion: I can only think of a few databases that couldn't handle this load. 10TB is getting a bit tricky and requires a bit of administration skill, and you might need to look at certain data lifecycle management techniques, but almost any RDBMS should be up to this task. Likewise, almost any non-relational/NoSQL database should be up to this task. In fact, almost any database of any sort should be up to the task.

If you (or your team members) already have skills in a particular product, just stick with that. If there's a specific product that excels in your problem domain, use that.

This isn't the type of problem that requires any kind of distributed magical unicorn powder.

Upvotes: 16

Rolice
Rolice

Reputation: 3103

Ok for MySQL I would advice you to use InnoDB without any indexes, expect on primary keys, even then, if you can skip them it would be good, in order to make input flow uninterrupted.

Indexes optimize reading, but descrease the writing capabilities.

You also may use PostgreSQL. Where you also need to skip indexes as well but you wont have a engine selection and its capabilities are also very strong for writing.

This approach you want is actually used in some solutions, but with two db servers, or at least two databases. The first is receiving a lot of new data (your case), while the second communicates with the first and store it in a well-structured database (with indexes, rules, etc). And then when you need to read or make a snapshot of the data you refer the second server (or second database), where you can use transactions and so on.

You should take a look and refer at Oracle Express (I think this was its name) and SQL Server Express Edition. The last two have better performance, but also some limitations. To have more detailed picture.

Upvotes: -1

Related Questions