Cassandra - Handling partition and bucket for large data size

Question

We have a requirement where application reads file and inserts data in Cassandra database, however the table can grow up to 300+ MB in one shot during the day. The table will have below structure

create table if not exists orders (
id uuid,
record text,
status varchar,
create_date timestamp,
modified_date timestamp,
primary key (status, create_date));

'Status' column can have value [Started, Completed, Done] As per couple of documents on internet, READ performance is best if it's < 100 MB and index should be used on a column that's least modified (so I cannot use 'status' column as index). Also if I use buckets with TWCS as Minutes then there will be lots of buckets and may impact.

So, how can I better make use of partitions and/or buckets for inserting evenly across partitions and reading records with appropriate status.

Thank you in advance.

Mike · Accepted Answer

From the discussion in the comments it looks like you are trying to use Cassandra as a queue and that is a big anti-pattern.
While you could store data about the operations you've done in Cassandra, you should look for something like Kafka or RabbitMQ for the queuing.

It could look something like this:

Application 1 copies/generates record A;
Application 1 adds the path of A to a queue;
Application 1 upserts to cassandra in a partition based on the file id/path (the other columns can be info such as date, time to copy, file hash etc);
Application 2 reads the queue, find A, processes it and determines if it is a failure or if it's completed;
Application 2 upserts to cassandra information about the processing including the status. You can also have stuff like reason for the failure;
If it is a failure then you can write the path/id to another topic.

So to sum it up, don't try to use Cassandra as a queue, that is a globally accepted anti-pattern. You can and should use Cassandra to persist a log of what you have done, including maybe the results of the processing (if applicable), how files were processed, their result and so on.
Depending on how you would further need to read and use the data in Cassandra you could think about using partitions and buckets based on stuff like, source of the file, type of file etc. If not, you could keep it partitioned by a unique value like the UUID I've seen in your table. Then you could maybe come to get info about it based on that.

Hope this heleped,
Cheers!

Cassandra - Handling partition and bucket for large data size

Answers (1)

Related Questions