Cassandra partition keys organisation

Question

I am trying to store the following structure in cassandra.

ShopID, UserID , FirstName , LastName etc....

The most of the queries on it are

select * from table where  ShopID = ? , UserID = ?

That's why it is useful to set (ShopID, UserID) as the primary key.

According to docu the default partitioning key by Cassandra is the first column of primary key - for my case it's ShopID, but I want to distribute the data uniformly on Cassandra cluster, I can not allow that all data from one shopID are stored only in one partition, because some of shops have 10M records and some only 1k.

I can setup (ShopID, UserID) as partitioning keys then I can reach the uniform distribution of records in the Cassandra cluster . But after that I can not receive all users that belong to some shopid.

select * 
from table 
where ShopID = ?

Its obvious that this query demand full scan on the whole cluster but I have no any possibility to do it. And it looks like very hard constraint.

My question is how to reorganize the data to solve both problem (uniform data partitioning, possibility to make full scan queries) in the same time.

Mikita Harbacheuski · Accepted Answer

In general you need to make user id a clustering column and add some artificial information to your table and partition key during saving. It allows to break a large natural partition to multiple synthetic. But now you need to query all synthetic partitions during reading to combine back natural partition. So the goal is find a reasonable trade-off between number(size) of synthetic partitions and read queries to combine all of them.

Comprehensive description of possible implementations can be found here and here (Example 2: User Groups).

Also take a look at solution (Example 3: User Groups by Join Date) when querying/ordering/grouping is performed by clustering column of date type. It can be useful if you also have similar queries.

Cassandra partition keys organisation

Answers (2)

Related Questions