What solutions/patterns can I consider for storing millions of raw data?

Question

Looking for opinions on storing raw data. The kind of data, that falls under the category of "track anything and everything". Mainly used for in-house analytic to drive direction, testing new features, etc.

Storing data is pretty easy. Just dump it into log files, no need for a database. On the other hand, if you want to perform complex analysis and data mining on it, then a database is helpful.

So I was thinking of storing raw data in Redis. Redis is fast at writes. Not ideal for archiving. Perfect for ephemeral data. I could write to Redis, then archive the result set for analysis in the future, if need be.

When it comes down to aggregating into a more readable/grouped format. An RDBMS like Postgres will suffice. However, I was thinking of using MongoDB's document structure. Perfect for reads, with the addition of their aggregation framework.

I could aggregate the raw data from Redis in batches, perhaps in a cron job or worker process, periodically.

So this is one example. I am pretty keen on Mongodb for the aggregation part. What other setups/solutions can I consider for storing millions of raw data? What are some of the best practices on this?

Eli · Accepted Answer

Storing data is pretty easy. Just dump it into log files, no need for a database. On the other hand, if you want to perform complex analysis and data mining on it, then a database is helpful.

This is partially true. Databases are definitely nice, but if you want to do heavy analytical queries on big data, Hadoop is also a really good option (Pig or Hive make this pretty easy to do). I've played around with Mongo's aggregation framework and didn't like it nearly as much as using Pig/Hive over Hadoop. It also doesn't have nearly as big a user network.

So, the answer here heavily depends on your use case. What kind of analysis do you want to do in (semi) real time, and what kind of analysis do you want to do later on in batches or manually?

Based on your post, it sounds like you mostly want to do analysis later on, on a case-by-case basis. For this, I would 100% use a logging framework like Kafka or Fluentd to grab data as it's coming in and stream it to different places. Those frameworks both provide parallelism and redundancy for moving your data as it comes in.

For sinks, I'd use HDFS or S3 for cold storage for later batch processing. With both, you get redundancy and usability over Hadoop. For real time processing, if you need it, I'd use Storm. You can also always add extra sinks to Mongo, for example, if you want to store to a database as well. That's one of the nicest things about logging frameworks: you can always add more sinks and more machines pretty easily.

Redis has lots of great use cases-- especially if you want a cache, or really fast operations on simple data structures. Unless I misunderstand what you want, I don't see Redis being particularly helpful here.

Mongo might well be useful to you, if aside from your analytical queries that do aggregation of some sort, you also want to query for specific items, or you want to run queries pretty quickly (expect Hadoop queries to take no less than 30 seconds-- even for something simple). In that case, like I mentioned, you just add an extra sink for Mongo. You can always add this later, if you're not sure you need it.

What solutions/patterns can I consider for storing millions of raw data?

Answers (2)

Related Questions