Reputation: 63
Let's say I have a table like this
CREATE TABLE request(
transaction_id text,
request_date timestamp,
data text,
PRIMARY KEY (transaction_id)
);
The transaction_id is unique, so as far as I understand each partition in this table would have one row only and I'm not sure if this situation causes a performance issue in the OS, maybe because Cassandra creates a file for each partition causing lots of files to manage for its hosting OS, as a note I'm not sure how Cassandra creates its files for its tables.
In this scenario I can find a request by its transaction_id like
select data from request where transaction_id = 'abc';
If the previous assumption is correct, a different approach could be the next one?
CREATE TABLE request(
the_date date,
transaction_id text,
request_date timestamp,
data text,
PRIMARY KEY ((the_date), transaction_id)
);
The field the_date would change every next day, so the partitions in the table would be created for each day.
In this scenario I would have to have the_date data always available to the client so I can find a request using the next query
select data from request where the_date = '2020-09-23' and transaction_id = 'abc';
Thank you in advance for your kind help!
Upvotes: 6
Views: 2257
Reputation: 87069
Cassandra doesn't create a separate file for each partition. One SSTable file may contain multiple partitions. Partitions that consist only of one row are often called "skinny rows" - they aren't very bad, but may cause some performance issues:
so it's really depends on the amount of data, access patterns, etc. It could be good or bad, depends on that factors.
If you have date available, and want to use it as part partition key - that may also not advisable because if you're writing and reading a lot of data on that day, then only some nodes will handle that load - this is so-called "hot partitions".
You may implement so-called bucketing, when you infer partition key from the data. But this will depend on the data available. For example, if you have date + transaction ID as a string, you may create partition key as date + 1st character of that string - in this case you'll have N partition keys per day, that are distributed between nodes, eliminating the hot partition problem.
See the corresponding best practices doc from DataStax about that topic.
Upvotes: 6
Reputation: 1298
Let me not get into the different types of keys, but let me mention and shortly explain the two keys you use in your question.
PRIMARY KEY
A row MUST have a unique primary key (which identifies the row as what it is regarding equality). The primary key can be a collection of columns (as in your second example with (the_date), transaction_id
) or just a single column (as in your first example with transaction_id
). Nevertheless, as mentioned the important part is that for a row the primary key must be unique to identify the row.
PARTITION KEY
The partition key is actually determined based on the primary key. You can have composite partition key (you used the syntax for that in your second example, to enforce the (the_date)
to be the partition key, this is actually not necessary since it would be by default the first column of the primary key).
Cassandra uses the hashed value of the (combined) partition key(s') values to determine on which node(s) the data is stored (or retrieved from when requesting data).
So the answer to your question is, it's totally ok to use the transaction_id
as primary and partition key. And that is not bad practice, it's more or less pretty common practice if you have a unique identifier in your data which can be stored in one row and fulfills your needs regarding requests.
More Infos:
Upvotes: 2