How do range queries work for clustering keys in Cassandra?

Question

According to the official doc: Clustering columns order data within a partition. When a table has multiple clustering columns the data is stored in nested sort order.

Suppose we have simple timeseries table:

CREATE TABLE alerts_by_year(
  year int,
  ts timestamp,
  alert text,
  PRIMARY KEY ((year), ts)
);

A simple query that get events for some range:

SELECT * FROM alerts_by_year
  WHERE year=2022
  AND ts >'2022-06-24 03:11:00'
  AND ts <'2022-06-24 04:11:00'

What is algorithm complexity to find this range through the "ts" clustering keys? Is it constant time or O(n) time? Does it depends on the type of storage used: memtable or sstable?

How does it work then? Are we simply iterating through "ts" clustering keys until we find the required range?

Erick Ramirez · Accepted Answer

Clustering columns are stored in sorted order. If you don't explicitly specify the order when you create a table, the clustering columns will be sorted in ascending order.

In your case, the following table option is automatically added to your table's definition:

WITH CLUSTERING ORDER BY (ts ASC)

In your case, the table schema looks like:

CREATE TABLE alerts_by_year(
  year int,
  ts timestamp,
  alert text,
  PRIMARY KEY ((year), ts)
) WITH CLUSTERING ORDER BY (ts ASC)

Since the rows in each partition is sorted in chronological order from oldest to newest timestamp, a range query on the ts column is done sequentially, iterating one row at a time until the condition is satisfied.

Note that the drivers will automatically page through the results. For example, the Java driver will return the first 5000 rows by default. You app will then need to retrieve the "next page" to get the next set of rows. Cheers!

How do range queries work for clustering keys in Cassandra?

Answers (1)

Related Questions