Secondary index in Apache Cassandra

Question

I tried to understand the secondary Index in Cassandra using the following link:

https://www.youtube.com/watch?v=BPvZIj5fvl4

Let's say we have 5 node N1, N2, N3, N4 and N5 cluster with Replication Factor of 3 which means a partition data will be replicated to 3 nodes in the cluster (say N1, N2 and N3).

Now when I execute this query:

SELECT *
FROM user
WHERE partitionKey = "somedata" AND ClusteringKey = "test";

with the Read consistency as '2'

It will query from any two of the nodes N1, N2 or N3

If I apply a secondary index on any of the column, How many nodes will the following query be executed?

SELECT *
FROM user
WHERE partitionKey = "somedata" AND secondaryKey = "test";

I have two queries in this:

As per the video, the above query on secondary index will read from all the 5 nodes in the cluster for search on secondaryIndexColumn? Is it correct?
Will there be any other performance impact in using secondary Index? - It would be Great If its explained why

Pedro Gordo · Accepted Answer

Cassandra will contact nodes until it reaches the LIMIT of rows to return, that satisfy your query, OR until it contacts all nodes. It does this by first contacting one node on the first round, two nodes on the 2nd round, four nodes on the third-round, and so on, starting with the node that contains the first token.

You can check the complete algorithm in this article (section E): https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive

One thing to look out for when using secondary indexes is if the indexed column has a high cardinality because this will create massive indexes, and hence use a lot of disk space. Avoid using secondary indexes on these columns.

Secondary index in Apache Cassandra

Answers (2)

Related Questions