Reputation: 31
I have a data set in cassandra database where every record has to be processed once every month (basically monthly subscription). Process runs every day, so data is divided into 31 chunks which are processed every day. I'm trying to design a partition key to avoid filtering all data set.
First solution would be to assign a partition key which is based on a day of the month. That means I have fixed number of partitions (31) which I can process every day. But the problem is that data size will increase over time but partition count will remain the same and I may hit the performance issues because of too wide rows.
Other solution would be not to deal with this problem at all and process all table using apache spark every day (basically select 1/31 of data using spark filtering). Over time when data will increase, but nodes in the cluster will also increase and I may have a constant performance. But all recommendations are against data filtering in cassandara.
Maximum number of rows that theoretically is possible to have in this case is about 1 billion.
What would be the recommendations?
Upvotes: 2
Views: 468
Reputation: 13731
As you suspect, planning to have just 31 partitions is a really bad idea for performance. The primary problem would be that the database cannot scale: When RF=3, there would be at most (under unlikely optimal conditions) 93 nodes which have any data, so you cannot scale to a bigger cluster. With Scylla (which divides the data further per core), you wouldn't be able to scale the cluster beyond 93 cores. The second problem is that Cassandra doesn't have very efficient indexing for reading from huge partitions, and reads become slower when the a single partition becomes huge.
A compromise could be to use not just 31 partitions, but instead - 31*K for some K. E.g., perhaps have a partition per hour, not per day. Or 100 partitions for each day. You'll need to find a way to consistently decide which record belongs to which of these partitions, but I guess you already have one (currently it assigns records to 31 partitions - all you need to change is to assign it to 31*K partitions). It just means that each day you'll need to scan instead of one partition, K separate partitions - but this is trivial.
Finally, since the number "31" is relatively small, another option you have is to use 31 separate tables. This will allow you to scan each table separately. I don't know which other queries you need to do, but if these will not need to cross table boundaries, splitting into 31 tables is a reasonable approach.
Upvotes: 3