Karol
Karol

Reputation: 23

How does ALLOW FILTERING work when we provide all of the partition keys?

I've read at least 50 articles on this and still don't know the answer ... I know how partitioning, clustering and ALLOW FILTERING work, but can't figure out what is the situation behind using ALLOW FILTERING with all partition keys provided in a query.

I have a table like this:

CREATE TABLE IF NOT EXISTS keyspace.events (
  date_string varchar,
  starting_timestamp bigint,
  event_name varchar,
  sport_id varchar
  PRIMARY KEY ((date_string), starting_timestamp, id)
);

How does query like this work ?

SELECT * FROM keyspace.events
WHERE
   date_string IN ('', '', '') AND
   starting_timestamp < '' AND
   sport_id = 1 /* not in partitioning nor clustering key */
ALLOW FILTERING;   

Is the 'sport_id' filtering done on records retreived earlier by the correctly defined keys ? Is ALLOW FILTERING still discouraged in this kind of query ?

How should I perform filtering in this particular situation ?

Thanks in advance

Upvotes: 1

Views: 1832

Answers (1)

M P
M P

Reputation: 300

  1. Yes, it should first filter out the partitions and then only will do the filtering on the non-key value and as per the experiment mentioned here : https://dzone.com/articles/apache-cassandra-and-allow-filtering I think its safe to use the allow filtering after all the keys in most case.

  2. It will highly depend on how much data you are filtering out as well - if the last condition of sport_id = 1 is trying to filter out most of the data then it will be a bad idea as it gives a lot of pressure to the database, so you need to consider the trade-offs here.

  3. Its not a good idea to use an IN clause with the partition key - especially the above query doesnt look good because its using both IN clause on Partition key and the allow filtering.

  4. Suggestion - Cassandra is very good at processing as many requests as you need in a second and the design idea should be to send more lighter queries at once than trying to send one query which does lot of work. So my suggestion would be to fire N calls to Cassandra each with = condition on partition key without filtering the last column and then combine and do final filter in the code (which ever language you are using I assume it can support sending all these calls parallel to the database). By doing so you will get the advantage in performance in long term when the data grows.

Upvotes: 3

Related Questions