how much data can be saved to wide column rows

Question

I understand for NoSQL wide column database such as Cassandra, you can use UserID as key for a row and saved all information related to that user in one row. For example, you can have a column family called "Personal Info" and save address/phone/name etc there. You can have another column family called "Work Info" and save his Title, office, history etc there. I would like to have another column family called "Project" and save a massive amount of project related data to that column family. My question is how much project data can I save. Is 2G OK? Is 200G ok?

Mike · Accepted Answer

Long answer: In Cassandra there is no real limit to the size of a table. The size is only limited by the number of nodes and their capacity. The throughput requirements might dictate how much a single node can efficiently handle but other than that you should be able to easily store tens of GB per node. I remember from the Datastax course that values in the range of 1-4TB were mentioned as the usual maximum range per node. That however will probably be possible only under specific use patterns and with well thought schema modeling and experienced DBAs capable of fine tuning everything.

Short answer: Tens of GBs per node should be easily achieved while a few TBs per node could be hoped for under favourable cases. As long as you can afford the nodes, the size should be theoretically unlimited.

Side info:

You need to take into account the replication factor when calculating the size. Using the recommended rf of 3 means that for 1GB of data the cluster will store 3GB;
While the nodes can store a lot of info, how much you write and especially how often you read and what latencies you expect and the consistency level will be the primary limit to the size per node; Having data that you rarely read will allow for nodes to just save it and not worry about it much. If you read a lot tho', given that reads are way more expensive than writes, your nodes will start answering slower and slower so more nodes will be required for the same data;
You should limit partitions to 10MB (most of them) and 100MB (all of them);
You should limit the number of rows inside a partition to under 100k;

Overall, size will probably be the least of your problems. Storage is cheap, what will set you back will be the computational requirements and that will be determined by access patterns, throughput, data model and node tuning.

Hope this helped, Cheers!

how much data can be saved to wide column rows

Answers (1)

Related Questions