Why can't I add WHERE clauses in Cassandra after filtering on the primary key?

Question

EDIT* Thanks for the code formatting kind stranger, i will keep in mind for future!

I am follow the basic planetcassandra.org Cassandra tutorial and I do not understand why I can't execute the following query:

select * 
from users 
where lastname = 'Smith' AND city = 'X';

on this table:

CREATE TABLE users 
(
    firstname text,
    lastname text,
    age int,
    email text,
    city text,
    PRIMARY KEY (lastname)
);

From my understanding, the partition key (lastname) partitions the data. So all rows with lastname Smith should be on node X. What is preventing me from filtering these rows even further by the city?

Thanks!

nickmbailey · Accepted Answer

There are two answers to your question here. One specific to your example, and a more general answer (which is probably what you are really after).

Answer for your example

In your specific example, you have a single primary key "lastname". So in this case there is only a single row per partition. Any time you update the row with the last name "Smith" you are overwriting any previous data in that row. In that case, a where clause doesn't really make sense because when you query for the "Smith" row there will only ever be one result.

More general answer

I'm guessing you meant your example to allow for more than one row per partition. Perhaps something like PRIMARY KEY (lastname, user_id) (or any column in the clustering key that would let you identify distinct users with the same last name).

Partitions can be quite large in Cassandra. Potentially millions of rows in a single partition. The clustering columns in your primary key are what determine how those rows are are ordered when stored on disk. So when you do a query on the clustering column, Cassandra can use that knowledge of the ordering of data to precisely find the data you are looking for.

If Cassandra were to allow querying on columns that are not in the clustering key, it would require scanning all data within the partition and checking each row against your query. This is would be extremely inefficient.

To expand on clustering columns even more, the actual order of your clustering columns is important as well. The ordering determines the way rows are stored on disk as mentioned above. So "PRIMARY KEY (a, b, c)" and "PRIMARY KEY (a, c, b)" are not the same. In the first example, rows are ordered on disk first by the "b" column and then all rows with the same value for the "b" column are ordered by the "c" column. This means that you could not query within the partition for columns with a particular value for "c" without also specifying "b". That query would again require scanning the entire partition since rows are first ordered by "b".

Knowing the exact queries you want to do up front will help you determine the clustering key you need and whether or not you need to denormalize into multiple tables to support multiple queries.

Why can't I add WHERE clauses in Cassandra after filtering on the primary key?

Answers (2)

Short answer

Cassandra stores data sequentially on disk (A quick dive into the C* read path)

Related Questions

Why can&#39;t I add WHERE clauses in Cassandra after filtering on the primary key?

Answers (2)

Short answer

Cassandra stores data sequentially on disk (A quick dive into the C* read path)

Related Questions

Why can't I add WHERE clauses in Cassandra after filtering on the primary key?