Cassandra partition size vs partitions count while processing a large part of the table

Question

I have a data set in cassandra database where every record has to be processed once every month (basically monthly subscription). Process runs every day, so data is divided into 31 chunks which are processed every day. I'm trying to design a partition key to avoid filtering all data set.

First solution would be to assign a partition key which is based on a day of the month. That means I have fixed number of partitions (31) which I can process every day. But the problem is that data size will increase over time but partition count will remain the same and I may hit the performance issues because of too wide rows.

Other solution would be not to deal with this problem at all and process all table using apache spark every day (basically select 1/31 of data using spark filtering). Over time when data will increase, but nodes in the cluster will also increase and I may have a constant performance. But all recommendations are against data filtering in cassandara.

Maximum number of rows that theoretically is possible to have in this case is about 1 billion.

What would be the recommendations?

Cassandra partition size vs partitions count while processing a large part of the table

Answers (1)

Related Questions