Reputation: 107
Datastax documentation talks about creating more than one secondary index here. But when I have to query in a where clause using both indexes the documentation suggests using allow filtering. 1) I am worried about using allow filtering on production and 2) If I am to use allow filtering, does that not defeat the whole purpose of those indexes in a scenario where I always have to use both those indexes together.
A possible solution seems to be custom indexes on both columns, but the apache documentation here is a bit vague and also does not speak about performance of these.
So what is the suggested approach when I need to query with multiple secondary indexes? Any opinions on solving this will be helpful.
EDIT1: A view of my cassandra table is available on this link represented as a Java Class. I have to query using where col1='val1' and col2='val2' and col3='val3'
EDIT2: I did think about creating a new column with data of col1,2,3 something like newcol='val1val2val3' so I can create a single secondary index on newcol and do away with this conundrum, but it seems to be a bit of a hack rather than strategic. Any comments on this plan will be appreciated. PS: This newcol will have a medium cardinality .
EDIT3: I did find good info on secondary indexes and allow filtering here which does seem to help
Upvotes: 3
Views: 5041
Reputation: 13731
As worst case for your use case, consider searching for an Austrian composer born in 1756. Yes, you can find him (Mozart) in a table of all humans who ever lived by intersecting the index of nationality=Austria, the index of birth=1756 and the index of profession=composer. But Cassandra will implement such a query very inefficiently - it either needs to retrieve huge lists and intersect them, or, what it really does, retrieve just one huge list (e.g., list of all Austrians who ever lived) and then filter them according to the other criteria (birth and profession). This is why you need the "ALLOW FILTERING". And why it is not a recommended use case for Cassandra's original Secondary Index.
Unlike Cassandra's original Secondary Index, search engines are geared toward exactly these sorts of intersections, and have special algorithms for calculating them efficiently. In particular, search engines usually have "skip lists", allowing find a small intersection of two lengthy lists by quickly skipping in one of the lists based on entries in the second list. They also have logic on which list (the shorter list, i.e., the rarer word) to start the process with.
As you may know, Cassandra has a second secondary-index implementation, known as SASI. SASI (see https://github.com/apache/cassandra/blob/trunk/doc/SASI.md) has many search-engine-oriented improvements over Cassandra's original secondary index implementation, and if I understand correctly (I never tried myself), efficient intersections is one of these features. So maybe switching to SASI is a good idea in your use case.
Upvotes: 1
Reputation: 16400
1) You should be. I highly recommend avoiding secondary indexes and ALLOW FILTERING consider them advanced features for corner cases.
2) It can be more efficient with index, but still horrible, and also horrible in more new ways. There are only very few scenarios where secondary indexes are acceptable. There are very few scenarios where ALLOW FILTERING is acceptable. You are looking at an overlap of the two.
Maybe take a step back. You're building pojos to represent objects and trying to map that into Cassandra. The approach you should make when data modeling with Cassandra is to think of the queries you are going to make and design tables to match that - not the data. It is normal to end up with multiple tables that you update (disk space and writes are cheap) on changes so that your reads can hit 1 partition efficiently and get everything you need in one hit. Denormalize the data, Cassandra is not relational and 3rd normal form is generally a bad thing here.
Upvotes: 3