Reputation: 263
EDIT* Thanks for the code formatting kind stranger, i will keep in mind for future!
I am follow the basic planetcassandra.org Cassandra tutorial and I do not understand why I can't execute the following query:
select *
from users
where lastname = 'Smith' AND city = 'X';
on this table:
CREATE TABLE users
(
firstname text,
lastname text,
age int,
email text,
city text,
PRIMARY KEY (lastname)
);
From my understanding, the partition key (lastname) partitions the data. So all rows with lastname Smith should be on node X. What is preventing me from filtering these rows even further by the city?
Thanks!
Upvotes: 3
Views: 3487
Reputation: 7305
You would have to have a clustering column -- city.
Update: Apologies for the terse response. Let me flesh this out a bit.
Cassandra is built from the ground up as a distributed system engineered for high performance and availability. Although based off of SQL, CQL is limited in the kinds of queries you can and cannot do, and often you have to design your data model around your query pattern (and duplicate data) around your work load / access patterns.
It is true that once you specify a the partition key in your cql where clause, cassandra knows what node your data is stored in. However, it still has to find the data within that node.
Remember that C* stores data sequentially based on clustering columns. In order to find the CQL row you are looking for, cassandra would have to do full seeks on disk which are slow once you sale out and have a lot of data. If you have clustering columns x, y, and z the data is sorted by the three clustering columns respectively. This is why you can only include where constraints for x, y, and z sequentially.
Check out this data modeling tool to visualize data models at the c* storage layer, to see possible queries, and generate stress-yamls.
Upvotes: 3
Reputation: 3684
There are two answers to your question here. One specific to your example, and a more general answer (which is probably what you are really after).
Answer for your example
In your specific example, you have a single primary key "lastname". So in this case there is only a single row per partition. Any time you update the row with the last name "Smith" you are overwriting any previous data in that row. In that case, a where clause doesn't really make sense because when you query for the "Smith" row there will only ever be one result.
More general answer
I'm guessing you meant your example to allow for more than one row per partition. Perhaps something like PRIMARY KEY (lastname, user_id) (or any column in the clustering key that would let you identify distinct users with the same last name).
Partitions can be quite large in Cassandra. Potentially millions of rows in a single partition. The clustering columns in your primary key are what determine how those rows are are ordered when stored on disk. So when you do a query on the clustering column, Cassandra can use that knowledge of the ordering of data to precisely find the data you are looking for.
If Cassandra were to allow querying on columns that are not in the clustering key, it would require scanning all data within the partition and checking each row against your query. This is would be extremely inefficient.
To expand on clustering columns even more, the actual order of your clustering columns is important as well. The ordering determines the way rows are stored on disk as mentioned above. So "PRIMARY KEY (a, b, c)" and "PRIMARY KEY (a, c, b)" are not the same. In the first example, rows are ordered on disk first by the "b" column and then all rows with the same value for the "b" column are ordered by the "c" column. This means that you could not query within the partition for columns with a particular value for "c" without also specifying "b". That query would again require scanning the entire partition since rows are first ordered by "b".
Knowing the exact queries you want to do up front will help you determine the clustering key you need and whether or not you need to denormalize into multiple tables to support multiple queries.
Upvotes: 6