Wicia
Wicia

Reputation: 565

Cassandra - search by clustered key

This is my diseases table definition:

id text,
drugid text,
name
PRIMARY KEY (drugid, id)

Now I want to perform search by drugid column only (all values in this column are unique). This primary key was created due to quick drug search.

Now - what will be best solution to filter this table using id? Creating new table? Pass additional value (drugid) to SELECT? Is it option with only id?

Thans for help :)

Upvotes: 0

Views: 179

Answers (1)

Horia
Horia

Reputation: 2982

Looking at your table definition, the partition key is drugid. This means that your queries will have to include the drugid. But since id is also part of the primary key, you could do something like:

select * from diseases where drugid = ? and id = ?

Unfortunately just having the id is not possible, unless you create a secondary index on it. Which wouldn't be very good since you could trigger a full cluster scan.

So, the solutions are:

  • specify the partition key (if possible), in this case drugid
  • create a new table that will have the id as partition key; in this case you will need to maintain both tables;

I guess the solution you'll choose depends on your data set. You should test to see how each solution behaves.

Should you use a secondary index?

When specifying the partition key, Cassandra will read the exact data from the partition and from only one node.

When you create a secondary index, Cassandra needs to read the data from partitions spread across the whole cluster. There are performance impact implications when an index is built over a column with lots of distinct values. Here is some more reading on this matter - Cassandra at Scale: The Problem with Secondary Indexes

In the above article, there is an interesting comment by @doanduyhai:

"There is only 1 case where secondary index can perform very well and NOT suffer from scalability issue: when used in conjunction with PARTITION KEY. If you ensure that all of your queries using secondary index will be of the form :

SELECT ... FROM ... WHERE partitionKey=xxx AND my_secondary_index=yyy

then you're safe to go. Better, in this case you can mix in many secondary indices. Performance-wise, since all the index reading will be local to a node, it should be fine"

I would stay away from secondary indexes.

From what you described, id will have distinct values, more or less, so you might run into performance issues since "a general rule of thumb is to index a column with low cardinality of few values".

Also, if id is a clustering column, the data will be stored in an ordered manner. The clustering column(s) determine the data’s on-disk sort order only within a partition key. The default order is ASC.

I would suggest some more reading - When not to use an index and Using a secondary index

Upvotes: 2

Related Questions