Reputation: 1537

Cassandra CLUSTERING ORDER with updates [performance]

With Cassandra it is possible to specify the cluster ordering on a table with a particular column.

CREATE TABLE myTable (
    user_id INT,
    message TEXT,
    modified DATE,
    PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);

Note: In this example, there is one message per user_id (intended)

Given this table my understanding is that the query's performance will be better in cases where recent data is queried.

However, if one where to make updates to the "modified" column does it add extra overhead on the server to "re-order" and is that overhead vs query performance significant?

In other words given this table would it perform better if the "CLUSTERING ORDER BY (modified DESC)" was dropped?

UPDATE: Updated the invalid CQL by adding modified to primary key, however, the original questions still stand.

Upvotes: 1

Answers (3)

harish

Reputation: 43

in your data model user_id is a rowkey/shardkey/partition key (userid) that is important for data locality and the clustering column (modified) specifies the order that the data is arranged inside the partition. combination of these two keys makes the primary key.

Even in RDBS world, updating PK is avoidble for sake of data integrity.

however in cassandra there is no constraints/relation between column families/tables. Assigning exact same values to Pk fields(userid,modified) will result in update the existing record else it will add set of fields.

refence: https://www.datastax.com/dev/blog/we-shall-have-order

Upvotes: 0

Carlos Monroy Nieblas

Reputation: 2283

Moving the comment as an answer, as reply of the updated question:

if one where to make updates to the "modified" column does it add extra overhead on the server to "re-order" and is that overhead vs query performance significant?

If modified is defined as part of the clustering key, you won't be able to update that record, but you will be able to add as many records as needed, each time with a different modified date.

Cassandra is an append-only database engine: this means that any update to the records will add a new record with a different timestamp, a select will consider the records with the latest timestamp. This means that there is no "re-order" operation.

Dropping or creating the clustering order should be defined in base of the query of how the information will be retrieved, if you are going to use only the latest records of that user_id, it makes sense to have the clustering order as you defined it.

Upvotes: 1

BernadetteD

Reputation: 81

In order to make modified a clustering column, it needs to be defined in the primary key.

CREATE TABLE myTable (
    user_id INT,
    message TEXT,
    modified DATE,
    PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);

This way, your data will be sorted primarily by the hashed value of the user_id, and within each user_id by modified. You don't need to drop the "WITH CLUSTERING ORDER BY (modified DESC)"

Upvotes: 1

Cassandra CLUSTERING ORDER with updates [performance]

Answers (3)

Related Questions