Cassandra schema use set collection or multiple rows

Question

I'm designing a keyspace in Cassandra that will hold information about groups of users. Some info on it:

Access to this data will only be made by requesting what users are contained in a certain group and updating the users that are contained in a group.
Reads will be much more frequent than writes.
Each group could contain up to 20,000 user IDs

I have two designs that I'm considering for this.

Multiple rows per group: The table would have two columns of type TEXT and be keyed on Primary Key (GroupID, UserID) and reading the users in a group would be done by select * from table where GroupID = {GroupID} and would return as many rows as there are users in the group.
One row per group using the Cassandra Set Collection: The table would have two columns, the first (GroupID) of type TEXT and the second (UserIDs) of type SET[TEXT] and be keyed on Pimary Key (GroupID). Reading the users in a gorup would be done by select * from table where GroupID = {GroupID} and would return a single row with the set of user ids contained in its UserIDs column set.

I can't find a lot of documentation surrounding what would be the better design for this scenario. Any thoughts or pros and cons to either scenario?

Jeff Jirsa · Accepted Answer

For a group of 20k user IDs, I would absolutely avoid using collections at all costs. Collections are a convenience feature, but they're not nearly as performant as using a traditional CQL data model where you have the PRIMARY KEY(GroupID,UserID) where all users are ordered in a single partition. That will be both easy to reason about, easy to query (can SELECT either a single partition and page through all group members, or you can SELECT ... WHERE GroupID=X and UserID=Y to determine if a user is in the group), and very performant.

Cassandra schema use set collection or multiple rows

Answers (2)

Related Questions