voipp
voipp

Reputation: 1461

Why Cassandra doesn't have secondary index?

Cassandra is positioned as scalable and fast database. Why , I mean from technical details, above goals cannot be accomplished with secondary indexes?

Upvotes: 1

Views: 278

Answers (2)

Highstead
Highstead

Reputation: 2441

So yes cassandra does have secondary indexes and aaron's explaination does a great job of explaining why.

You see many people trying to solve this issue by writing their data to multiple tables. This is done so they can be sure that the data they need to answer the query that would traditionally rely on a secondary index is on the same node.

Some of the recent iterations of cassandra have this 'built in' via materialized views. I've not really used them since 3.0.11 but they are promising. The problems i had at the time were primarily adding them to tables with existing data and they had a suprisingly large amount of overhead on write (increased latency).

Upvotes: 0

Aaron
Aaron

Reputation: 57808

Cassandra does indeed have secondary indexes. But secondary index usage doesn't work well with distributed databases, and it's because each node only holds a subset of the overall dataset.

I previously wrote an answer which discussed the underlying details of secondary index queries:

How do secondary indexes work in Cassandra?

While it should help give you some understanding of what's going on, that answer is written from the context of first querying by a partition key. This is an important distinction, as secondary index usage within a partition should perform well.

The problem is when querying only by a secondary index, that Cassandra cannot guarantee all of your data will be able to be served by a single node. When this happens, Cassandra designates a node as a coordinator, which in turn queries all other nodes for the specified indexed values.

Essentially, instead of performing sequential reads from a single node, secondary index usage forces Cassandra to perform random reads from all nodes. Now you don't have just disk seek time, but also network time complicating things.

The recommendation for Cassandra modeling, is to duplicate your data into new tables to support the desired query. This adds in some other complications with keeping data in-sync. But (when done correctly) it ensures that your queries can indeed be served by a single node. That's a tradeoff you need to make when building your model. You can have convenience or performance, but not both.

Upvotes: 2

Related Questions