Islam Hassan
Islam Hassan

Reputation: 712

Are secondary indices always a bad idea in Cassandra even if I specify them in conjunction with the partitioning key in all my queries?

I know that secondary indices in Cassandra are generally a bad idea because the index is stored locally in each node i.e. not distributed across the cluster which may result in a query scanning a huge number of nodes. However, I don't understand why they are still a bad idea if I always specify the partition key in my queries and only use the secondary index as a final filter. I've read that they don't scale with large amounts of data even if I specify the partition key. Is this true? and if it's then why?

Upvotes: 4

Views: 1355

Answers (3)

Saifallah KETBI
Saifallah KETBI

Reputation: 303

In general secondary indexes are bad idea, not only for the distributed part, but also for the index size and the number of distinct value, so if you have a field with high or low cardinality,you will be spending time on scanning many rows or many columns. Also you can have other issue while dealing with tombstones ...

To answer your question, secondary index in Cassandra doesn't scale that good, but if you use a partition key and by it you tell Cassandra which node have the data, it perform really better ! you can find more details here in section F :

https://www.datastax.com/blog/2016/04/cassandra-native-secondary-index-deep-dive

I hope this helps !

Upvotes: 5

LetsNoSQL
LetsNoSQL

Reputation: 1538

Cassandra on a ring of five machines, with a primary index of user IDs and a secondary index of user emails. If you were to query for a user by their ID or by their primary indexed key any machine in the ring would know which machine has a record of that user. One query, one read from disk. However to query a user by their email or their secondary indexed value each machine has to query its own record of users. One query, five reads from disk. By either scaling the number of users system wide, or by scaling the number of machines in the ring, the noise to signal-to-ratio increases and the overall efficiency of reading drops. In some cases to the point of timing out also. Please refer below link for good explanation on secondary index. https://dzone.com/articles/cassandra-scale-problem

Upvotes: 2

Ram Pratap
Ram Pratap

Reputation: 520

These guys have a nice write-up on the performance impacts of secondary indexes: 

https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes

The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be queried). So there's not just an impact on writes, but on read performance as well.

Upvotes: 1

Related Questions