Carlos Avila
Carlos Avila

Reputation: 63

Is it a bad practice to have a Cassandra table with partitions of a single row?

Let's say I have a table like this

CREATE TABLE request(
  transaction_id text,
  request_date timestamp,
  data text, 
  PRIMARY KEY (transaction_id)
);

The transaction_id is unique, so as far as I understand each partition in this table would have one row only and I'm not sure if this situation causes a performance issue in the OS, maybe because Cassandra creates a file for each partition causing lots of files to manage for its hosting OS, as a note I'm not sure how Cassandra creates its files for its tables.

In this scenario I can find a request by its transaction_id like

select data from request where transaction_id = 'abc';

If the previous assumption is correct, a different approach could be the next one?

CREATE TABLE request( 
  the_date date, 
  transaction_id text, 
  request_date timestamp, 
  data text, 
  PRIMARY KEY ((the_date), transaction_id)
);

The field the_date would change every next day, so the partitions in the table would be created for each day.

In this scenario I would have to have the_date data always available to the client so I can find a request using the next query

select data from request where the_date = '2020-09-23' and transaction_id = 'abc';

Thank you in advance for your kind help!

Upvotes: 6

Views: 2257

Answers (2)

Alex Ott
Alex Ott

Reputation: 87069

Cassandra doesn't create a separate file for each partition. One SSTable file may contain multiple partitions. Partitions that consist only of one row are often called "skinny rows" - they aren't very bad, but may cause some performance issues:

  • to access such partitions you still need to read a block with compressed data (by default it's 64Kb) that needs to be decompressed to read that data. If you're doing really random access, such blocks would be discarded from file cache and needs to be re-read from disk. In this case, it's maybe useful to decrease the block size
  • if you have a lot of such partitions per table per node - this may heavily increase the size of the bloom filter, because each partition has a separate entry in it. I saw some customers that had tens of gigabytes of memory allocated for bloom filter only because of the skinny partitions

so it's really depends on the amount of data, access patterns, etc. It could be good or bad, depends on that factors.

If you have date available, and want to use it as part partition key - that may also not advisable because if you're writing and reading a lot of data on that day, then only some nodes will handle that load - this is so-called "hot partitions".

You may implement so-called bucketing, when you infer partition key from the data. But this will depend on the data available. For example, if you have date + transaction ID as a string, you may create partition key as date + 1st character of that string - in this case you'll have N partition keys per day, that are distributed between nodes, eliminating the hot partition problem.

See the corresponding best practices doc from DataStax about that topic.

Upvotes: 6

Philipp
Philipp

Reputation: 1298

Let me not get into the different types of keys, but let me mention and shortly explain the two keys you use in your question.

PRIMARY KEY

A row MUST have a unique primary key (which identifies the row as what it is regarding equality). The primary key can be a collection of columns (as in your second example with (the_date), transaction_id) or just a single column (as in your first example with transaction_id). Nevertheless, as mentioned the important part is that for a row the primary key must be unique to identify the row.

PARTITION KEY

The partition key is actually determined based on the primary key. You can have composite partition key (you used the syntax for that in your second example, to enforce the (the_date) to be the partition key, this is actually not necessary since it would be by default the first column of the primary key).

Cassandra uses the hashed value of the (combined) partition key(s') values to determine on which node(s) the data is stored (or retrieved from when requesting data).

So the answer to your question is, it's totally ok to use the transaction_id as primary and partition key. And that is not bad practice, it's more or less pretty common practice if you have a unique identifier in your data which can be stored in one row and fulfills your needs regarding requests.

More Infos:

Upvotes: 2

Related Questions