qiGuar
qiGuar

Reputation: 1804

Cassandra index explained

Ok, I've been searching for explanation for a while now, but still can't find the answer.

When we talk about Cassandra index, I got main points, on of which is:

On low-cardinality I get it - when searching we'll get very wide row.

But what happens behind the scenes with high-cardinality data? All books and blogs seem to copy the datastax example which doesn't explain WHY, but simply tells you not to do this.

Suppose, I want to create an index on user email. If I understand correctly, when I search for user by email, 2 things will happen:

  1. Ask all nodes, which one has user id related to this email
  2. Get user from the right partition by user id

If I create index on user country(which seems to be more appropriate field), the algorithm should be the same.

So, please, explain what I'm missing from understanding why it's bad to use index on high-cardinality data.

Also, on the related topic: is there a case, when index is more preferred than materialized view?

Upvotes: 2

Views: 440

Answers (2)

HerberthObregon
HerberthObregon

Reputation: 2141

In summary: Use indices when you know the partition key, you need to do a fulltext search that has to hit all nodes, or do a count of something, for example, how many times have you seen all the articles published in a blog(that has to hit all nodes)and you need specific value like a:

 WHERE age = 18

Use materialized views when you DO NOT KNOW THE PARTITION KEY and you need range like a:

WHERE age > 18 and age < 30

References:

Principal Article!

Cassandra Secondary Index Preview #1

Here is a comparison with the Materialized Views and the secondary indices

Materialized View Performance in Cassandra 3.x

And here is where the PK is known is more effective to use an index

Cassandra Native Secondary Index Deep Dive

Upvotes: 1

Ashraful Islam
Ashraful Islam

Reputation: 12840

Suppose you create index on high-cardinality column like email.
If you query for userid by email, cassandra needs to execute that query on all host to get that single userid. You are querying on all host to get a single userid, isn't that costly ?

Instead if you would have create a table like that below one

CREATE TABLE userid_by_email (
    email text PRIMARY KEY,
    userid bigint
);

Cassandra will return the userid by querying on a single host.

And here is your other question's answer https://stackoverflow.com/a/36476772/2320144

Upvotes: 0

Related Questions