Coder
Coder

Reputation: 3262

Secondary index in Apache Cassandra

I tried to understand the secondary Index in Cassandra using the following link:

Let's say we have 5 node N1, N2, N3, N4 and N5 cluster with Replication Factor of 3 which means a partition data will be replicated to 3 nodes in the cluster (say N1, N2 and N3).

Now when I execute this query:

SELECT *
FROM user
WHERE partitionKey = "somedata" AND ClusteringKey = "test";

with the Read consistency as '2'

It will query from any two of the nodes N1, N2 or N3

If I apply a secondary index on any of the column, How many nodes will the following query be executed?

SELECT *
FROM user
WHERE partitionKey = "somedata" AND secondaryKey = "test";

I have two queries in this:

  1. As per the video, the above query on secondary index will read from all the 5 nodes in the cluster for search on secondaryIndexColumn? Is it correct?
  2. Will there be any other performance impact in using secondary Index? - It would be Great If its explained why

Upvotes: 1

Views: 1452

Answers (2)

Pedro Gordo
Pedro Gordo

Reputation: 1865

Cassandra will contact nodes until it reaches the LIMIT of rows to return, that satisfy your query, OR until it contacts all nodes. It does this by first contacting one node on the first round, two nodes on the 2nd round, four nodes on the third-round, and so on, starting with the node that contains the first token.

You can check the complete algorithm in this article (section E): https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive

One thing to look out for when using secondary indexes is if the indexed column has a high cardinality because this will create massive indexes, and hence use a lot of disk space. Avoid using secondary indexes on these columns.

Upvotes: 4

Evaldas Buinauskas
Evaldas Buinauskas

Reputation: 14077

To fill the discussion from comments:

Both up-to-date queries will be executed on two nodes because you're supplying partition key. By doing that Cassandra Query Engine can know in what exact node that data lives.

If you were to run the following query:

SELECT *
FROM user
WHERE secondaryKey = "test";

This would run in all of your nodes that your table has data in and would have to scan each node based on that secondary key.

Like I said, secondary keys are local to node, which means if you'd have users table and your information would look somehow like that:

user_id  user_name
---------------------------
1        a_very_cool_user
2        a_very_cooler_user
3        the_coolest_user

So if we'd partition this data into three partitions, assume that each of these three nodes would have one row only:

  • node 1 would have a_very_cool_user
  • node 2 would have a_very_cooler_user
  • node 3 would have the_coolest_user

And if you were to index user_name field, then node 1 would have indexed just a_very_cool_user and would not know what's in the other two nodes. Same applies to the other ones. That's what local secondary indexes do in Cassandra.

Upvotes: 2

Related Questions