Reputation: 11
I am currently running a project where we need to annually store 40 billion documents (PDF,TIFF) for roughly 200 million accounts and was wondering if it is possible to use Cassandra for that? this is mainly because of the scalability, stability and multiple datacenter use in the Cassandra design.
But I wonder if it is a good idea to use Cassandra for this at all - or would another alternative like CouchDB be a better option?
Just a note, we don't need full text search in the documents and for each document there will only be a limited of metadata attached to each - like date, time, origin, owner and unique id, plus a few keywords. Access to documents will normally be done through a query on owner id and from there choose the document needed through origin and optionally date/time. So nothing fancy.
Thanks for your thoughts on this.
Upvotes: 1
Views: 1525
Reputation: 42617
Just a few thoughts:
You might want to also consider a distributed file system such as HDFS.
40 billion per year is 1361 per second - Cassandra can handle this kind of write load, assuming the documents are modestly sized and not all huge multi-megabyte files.
What kind of read load are you anticipating?
Will the documents be preserved for ever i.e. 40 billion added per year indefinitely?
If a document is 100KB (say), that's 4 petabytes per year, I think? I've not heard of a Cassandra cluster that big - it would be worth asking on the Cassandra mailing list (with some realistic figures rather than my guesses!).
I've heard that a Cassandra node can typically manage 1TB under heavy load, maybe 10TB under light load. So that's at least a 400-node cluster for year one, possibly much more, especially if you want replication.
This page gives some 2009 figures for HDFS capabilities - 14 petabytes (60 million files) using 4000 nodes, plus a lot of other interesting detail (e.g. name nodes needing 60GB of RAM).
Upvotes: 1