Cassandra: ALLOW FILTERING works twice as fast as index on a mapped column

Question

I read how ALLOW FILTERING is terrible and shall never ever be used in production. To my surprise, I discovered that in my case a query runs FASTER when I remove a secondary index and add ALLOW FILTERING. How is that possible?

More details: this query involves a secondary index on a map column.

Even more details

My table looks like this (some column names were changed to protect the IP, but that should not matter):

CREATE TABLE IF NOT EXISTS test.data (
  channel text,
  subdomain text,
  deployment_group text,
  data_center text,
  day date, // we need this to limit the partition size
  message_id timeuuid,
  tags map,
  sender_id text,
  sequence_number bigint,
  sent_at timestamp,
  received_at timestamp,
  header blob,
  payload blob,
  PRIMARY KEY ((channel, subdomain, deployment_group, data_center, day), sent_at, message_id)
)
WITH CLUSTERING ORDER BY (sent_at ASC, message_id ASC);

The index I was using is CREATE INDEX idx_tags ON test.data (ENTRIES (tags));

The table has about 10M records, the time range query has about 50K rows, and additional condition on the tag column selects about 8K rows.

This query runs for 2.5s:

SELECT message_id, sender_id, sequence_number, sent_at, tags FROM test.data 
WHERE channel='MyChannel' AND subdomain='test' AND deployment_group='_' 
AND data_center='xyz' and day='2023-07-12' 
and sent_at >= '2023-07-12 20:00:00+0000' and sent_at <'2023-07-12 21:00:00+0000' 
and tags['symbol']='ABC';

If I drop the index and re-run the query with ALLOW FILTERING, it takes 2.5s:

This query runs for 4.4s:

SELECT message_id, sender_id, sequence_number, sent_at, tags FROM test.data 
WHERE channel='MyChannel' AND subdomain='test' AND deployment_group='_' 
AND data_center='xyz' and day='2023-07-12' 
and sent_at >= '2023-07-12 20:00:00+0000' and sent_at <'2023-07-12 21:00:00+0000' 
and tags['symbol']='ABC' 
ALLOW FILTERING;

How is it possible? Are indexes on mapped columns THAT inefficient? I am flabbergasted.

clunven · Accepted Answer

Looking at your query I see that in the WHERE clause you put every single variables of your partition key. (WHERE channel='MyChannel' AND subdomain='test' AND deployment_group='' AND data_center='xyz' and day='2023-07-12')_

As a consequence, the query will execute on a single partition. Under those circumstances the ALLOW FILTERING is not actually an harmful thing as long as the partition is not huge. (not the size of the table, nor the inequalities positionned after in the query.

The secondary index will look for the index on every node, it tries to find the list of partitions to scan first and then do then job, this is where you are loosing time.

Cassandra: ALLOW FILTERING works twice as fast as index on a mapped column

Answers (1)

Related Questions