Reputation: 1882
I'm relatively new to cassandra and have to evaluate different NoSQL-Solutions for a monitoring tool. One datum is just about 100 Bytes big, but there are really a lot of them. During a day we get about 15 million records... So I'm currently testing with 900 million records (about 15GB as SQL-Insert Script)
My first question is: Does cassandra fit my needs? I need to do range querys (on the date the records were created) and sum up some of the columns according to groups defined by "secondary indexes" stored in the datum.)
I already tried MongoDB but it's really poor MapReduce made a really crappy job... I also read about HBase, but the enormous amount of configuration needed for it makes me hope that there could be solution with Cassandra...
A second question is: how I could store my data to access it in the ways mentioned above? I already thought of a super column family, where the key is the date (as long since 1970) and the columns would be the datums taken at that time... but if I use Random Partitioner, I can't do fast range querys on it (as I know) and if I use Order Preserving Partitioner the data won't be spread over my cluster (currently consisting of two nodes).
I hope I gave you all the necessary information... Thank you for your help!
andy
Upvotes: 2
Views: 417
Reputation: 629
We had a similar situation.
We store our data in simple rows, where the row key is in the form <id>:<time-bucket>
. Our current time bucket size is 24h. The column is the timestamp, and the value is a small object serialized with msgpack
.
We do aggregation manually if needed.
We also do a small optimization: when the bucket is full, it becomes immutable, so we create an "all" object holding all values in a single column. Then the per-timestamp columns can be purged. This allows us to fetch a whole bucket and deserialize it in O(1) rather than scanning through the row.
Upvotes: 0
Reputation: 19377
Sounds like a job for Brisk (Cassandra + Hadoop distribution). Full Hadoop map/reduce including Hive support, virtually no configuration required.
http://www.datastax.com/products/brisk
Upvotes: 3