Cassandra performance on clustered column sorting vs secondary index

Question

my schema is :

A)

CREATE TABLE friend_list (
    userId uuid,
    friendId uuid,
    accepted boolean, 
    ts_accepted timestamp,
    PRIMARY KEY ((userId) ,accepted, ts_accepted)
   ) with clustering order by (accepted desc, ts_accepted desc);

B)

CREATE TABLE friend_list (
        userId uuid,
        friendId uuid,
        accepted boolean, 
        ts_accepted timestamp,
        PRIMARY KEY (userId , ts_accepted)
       ) with clustering order by (ts_accepted desc);
CREATE INDEX ON friend_list (accepted);

Which will give the best performance for the query :

SELECT * FROM friend_list WHERE userId="---" AND accepted=true;

With my understanding, Cassandra automatically sorts the clustered columns in ASC order and we specify DESC if we need to change the default sorting order for efficient queries.

With my schema A, I am making 'accepted' as a clustered key, but I need to sort it unnecessarily as I definitely have to sort 'ts_accepted' to DESC. Will this unwanted sorting of 'accepted' affects performance ?

If so, say I am making 'accepted' as secondary index in schema B. I know secondary index are not bad for low cardinal values(boolean). But still the query might have some performance issue.

Please let me know the efficient way on achieving this query.

ashic · Accepted Answer

I'd go for A.

If you can avoid a secondary index, avoid it (exception: you know it'll be a spark job that would benefit from it). If you still need a secondary index, redesign your model. If you still need it, feel horrible inside, and then maybe consider it.

The cost of clustering order by that you fear isn't appropriate. Cassandra stores clustering columns sorted anyway...ASC or DESC doesn't change things. You're using a tad more space, but for your query, you want to hit "accepted", so it's justified. I'm guessing ts_accepted is needed for some other reason? The only catch here is that if you need or have access to ts_accepted in your query, you need to provide an accepted equality filter. Performance wise, I don't see an issue.

As for B, indexes on extremely low cardinality columns (like bools) is bad. Consider how the data is stored - for each node, Cassandra maintains a table where the key is the value (true / false) and the values are the keys of all data for that node that matches the key. That has the potential for be a very wide column. Would you do that if you were modelling a separate table? No. And you shouldn't do that with an index either.

I don't know about the rest of the data, but if you're looking to get friends that have been accepted, why bother with a boolean? You may be able to use the ts_accepted column to infer the boolean. If they have a value, it's accepted, right?

One thing you should be aware of is that you can't update a column that's part of the pk.

Lastly, you ARE hitting the partition key (UserId) for your query. This is very good for your query. It means it'll hit exactly one partition. Depending on your use case (and size of entries), it might even be feasible to load the whole partition and filter client / app side. Of course, this depends on expected friend list sizes, and the data size vs network traffic vs app processing you need / are willing to do. For example, loading 100 entries and filtering for accepted app side, and loading 50 entries by filtering db side might have similar performance numbers.

Cassandra performance on clustered column sorting vs secondary index

Answers (2)

Related Questions