Derek Troy-West
Derek Troy-West

Reputation: 2469

Cassandra 2.0.2 CQL Long Row Limitation / Performance Impact

Given a simple CQL table which stores an ID and a Blob, is there any problem or performance impact of storing potentially billions of rows?

I know with earlier versions of Cassandra wide rows were de rigueur, but CQL seems to encourage us to move away from that. I don't have any particular requirement to ensure the data is clustered together or able to filter in any order. I'm wondering whether very many rows in a CQL table could be problematic in any way.

I'm considering binning my data, that is - creating a partition key which is a hash%n of the ID and would limit the data to n 'bins' (millions of?). Before I add that overhead I'd like to validate whether it's actually worthwhile.

Upvotes: 0

Views: 240

Answers (2)

Derek Troy-West
Derek Troy-West

Reputation: 2469

In a chat with Aaron Morton (last pickle) he indicated that billions of rows per table is not necessarily problematic.

Leaving this answer for reference, but not selecting it as "talked to a guy who knows a lot more than me" isn't particularly scientific.

Upvotes: 0

Alex Popescu
Alex Popescu

Reputation: 4002

First, I don't think is correct.

I know with earlier versions of Cassandra wide rows were de rigueur, but CQL seems to encourage us to move away from that.

Wide rows are supported and well. There's a post from Jonathan Ellis Does CQL support dynamic columns / wide rows?:

A common misunderstanding is that CQL does not support dynamic columns or wide rows. On the contrary, CQL was designed to support everything you can do with the Thrift model, but make it easier and more accessible.

For the part about the "performance impact of storing potentially billions of rows" I think the important part to keep in mind is the size of these rows.

According to Aaron Morton in this mail thread:

When rows get above a few 10’s of MB things can slow down, when they get above 50 MB they can be a pain, when they get above 100MB it’s a warning sign. And when they get above 1GB, well you you don’t want to know what happens then.

and later:

Larger rows take longer to go through compaction, tend to cause more JVM GC and have issue during repair. See the in_memory_compaction_limit_in_mb comments in the yaml file. During repair we detect differences in ranges of rows and stream them between the nodes. If you have wide rows and a single column is our of sync we will create a new copy of that row on the node, which must then be compacted. I’ve seen the load on nodes with very wide rows go down by 150GB just by reducing the compaction settings.

IMHO all things been equal rows in the few 10’s of MB work better.

Upvotes: 1

Related Questions