Is it a bad practice to have a Cassandra table with partitions of a single row?

Question

Let's say I have a table like this

CREATE TABLE request(
  transaction_id text,
  request_date timestamp,
  data text, 
  PRIMARY KEY (transaction_id)
);

The transaction_id is unique, so as far as I understand each partition in this table would have one row only and I'm not sure if this situation causes a performance issue in the OS, maybe because Cassandra creates a file for each partition causing lots of files to manage for its hosting OS, as a note I'm not sure how Cassandra creates its files for its tables.

In this scenario I can find a request by its transaction_id like

select data from request where transaction_id = 'abc';

If the previous assumption is correct, a different approach could be the next one?

CREATE TABLE request( 
  the_date date, 
  transaction_id text, 
  request_date timestamp, 
  data text, 
  PRIMARY KEY ((the_date), transaction_id)
);

The field the_date would change every next day, so the partitions in the table would be created for each day.

In this scenario I would have to have the_date data always available to the client so I can find a request using the next query

select data from request where the_date = '2020-09-23' and transaction_id = 'abc';

Thank you in advance for your kind help!

Alex Ott · Accepted Answer

Cassandra doesn't create a separate file for each partition. One SSTable file may contain multiple partitions. Partitions that consist only of one row are often called "skinny rows" - they aren't very bad, but may cause some performance issues:

to access such partitions you still need to read a block with compressed data (by default it's 64Kb) that needs to be decompressed to read that data. If you're doing really random access, such blocks would be discarded from file cache and needs to be re-read from disk. In this case, it's maybe useful to decrease the block size
if you have a lot of such partitions per table per node - this may heavily increase the size of the bloom filter, because each partition has a separate entry in it. I saw some customers that had tens of gigabytes of memory allocated for bloom filter only because of the skinny partitions

so it's really depends on the amount of data, access patterns, etc. It could be good or bad, depends on that factors.

If you have date available, and want to use it as part partition key - that may also not advisable because if you're writing and reading a lot of data on that day, then only some nodes will handle that load - this is so-called "hot partitions".

You may implement so-called bucketing, when you infer partition key from the data. But this will depend on the data available. For example, if you have date + transaction ID as a string, you may create partition key as date + 1st character of that string - in this case you'll have N partition keys per day, that are distributed between nodes, eliminating the hot partition problem.

See the corresponding best practices doc from DataStax about that topic.

Is it a bad practice to have a Cassandra table with partitions of a single row?

Answers (2)

Related Questions