Reputation: 3210
I am trying to store the following structure in cassandra.
ShopID, UserID , FirstName , LastName etc....
The most of the queries on it are
select * from table where ShopID = ? , UserID = ?
That's why it is useful to set (ShopID, UserID
) as the primary key.
According to docu the default partitioning key by Cassandra is the first column of primary key - for my case it's ShopID
, but I want to distribute the data uniformly on Cassandra cluster, I can not allow that all data from one shopID
are stored only in one partition, because some of shops have 10M records and some only 1k.
I can setup (ShopID, UserID
) as partitioning keys then I can reach the uniform distribution of records in the Cassandra cluster . But after that I can not receive all users that belong to some shopid
.
select *
from table
where ShopID = ?
Its obvious that this query demand full scan on the whole cluster but I have no any possibility to do it. And it looks like very hard constraint.
My question is how to reorganize the data to solve both problem (uniform data partitioning, possibility to make full scan queries) in the same time.
Upvotes: 1
Views: 644
Reputation: 2253
Each node in Cassandra is responsible for some token ranges. Cassandra derives a token from row's partition key using hashing and sends the record to node whose token range includes this token. Different records can have the same token and they are grouped in partitions. For simplicity we can assume that each cassandra nodes stores the same number of partitions. And we also want that partitions will be equal in size for uniformly distribution between nodes. If we have a too huge partition that means that one of our nodes needs more resources to process it. But if we break it in multiple smaller we increase the chance that they will be evenly distirbuted between all nodes.
However distribution of token ranges between nodes doesn't related with distribution of records between partitions. When we add a new node it just assumes responsibility for even portion of token ranges from other nodes and as the result the even number of partitions. If we had 2 nodes with 3 GB of data, after adding a third node each node stores 2 GB of data. That's why scalability isn't affected by partitioning and you don't need to change your historical data after adding a new node.
Upvotes: 1
Reputation: 2253
In general you need to make user id a clustering column and add some artificial information to your table and partition key during saving. It allows to break a large natural partition to multiple synthetic. But now you need to query all synthetic partitions during reading to combine back natural partition. So the goal is find a reasonable trade-off between number(size) of synthetic partitions and read queries to combine all of them.
Comprehensive description of possible implementations can be found here and here (Example 2: User Groups).
Also take a look at solution (Example 3: User Groups by Join Date) when querying/ordering/grouping is performed by clustering column of date type. It can be useful if you also have similar queries.
Upvotes: 3