Reputation: 1431
I'm doing some data modelling for time series data in Cassandra, and I've decided to implement buckets to regulate my partition sizes and maintain reasonable distribution on my cluster.
I decided to bucketise such that my partitions would not exceed a size of 10MB, as I've seen numerous sources that state this as an ideal partition size, but I can't find any information on why 10MB was chosen. On top of this I can't find anything from DataStax or Apache that mentions this soft 10MB limit at all.
Our data can be requested for large periods of time, meaning lots of partitions will be required to service 1 request if the partition sizes remain at 10MB. I'd rather increase the size of the partitions, and have fewer partitions required to service these requests.
Where does this idea of a 10MB partition size come from? Is it still relevant? What would be so bad if my partitions were 20MB in size? Or even 50MB?
With 10MB referenced in so many places, I feel like there must be something to it. Any information would be appreciated. Cheers.
Upvotes: 2
Views: 333
Reputation: 87069
I think that many of these advises are coming from old time, when support for wide partitions weren't very good - it was a lot of pressure on heap when we read data, etc.. Since Cassandra 3.0 the situation heavily improved, but it's still recommended to keep the size on the disk under 100Mb.
For example, DataStax planning guide says in section "Estimating partition size":
a good rule of thumb is to keep the maximum number of rows below 100,000 items and the disk size under 100 MB
In recent versions of Cassandra we can go beyond this recommendation, but it still not advised, although it heavily depends on the access patterns. You can find more information in the following blog post, and this video.
I have seen users with 60+Gb partitions - system still works, but the data distribution is not ideal, so nodes are becoming "hot", and performance may suffer.
Upvotes: 4