Reputation: 872
I have a table (let's call it T) set up with a PRIMARY KEY like the following:
PRIMARY KEY ((A, B), C, ....);
I want to query it like the following:
SELECT * FROM T WHERE A = ? and C <= ? PER PARTITION LIMIT 1 ALLOW FILTEIRNG;
(Note that C is a timstamp value. I am essentially asking for the most recent rows across all partitions whose first partition key belongs to my input).
This works with the allow filtering command, and it makes sense why I need it; I do not know beforehand the partition keys B, and I do not care - I want all of them. Therefore, it makes sense that Cassandra would need to scan the entire partition to give me the results, and it also makes sense why I would need to specify it to allow filtering for this to occur.
However, I have read that we should avoid 'ALLOW FILTERING' at all costs, as it can have a huge performance impact, especially in production environments. Indeed, I only use allow filtering very sparingly in my existing applications, and this is usually for one-off queries that calculate something of this nature.
My quesiton is this: is there a way to restructure this table or query to avoid filtering? I am thinking it is impossible, as I do not have knowledge of the keys that make up B beforehand, but I want to double check just to be sure. Thanks!
Upvotes: 0
Views: 635
Reputation: 16400
You cannot efficiently make that query if (A, B)
is your partition key. your key would need to be ((A), B)
(dropping clustering keys). Then SELECT * FROM T WHERE A = ?
. If only care about the latest, then A, B would always be replaced with the most recent.
Another option if looking to get the A,B tuples from a time is to create a table thats indexed by time and have the tuples be clustering columns from there like ((time_bucket), A, B, C)
. time_bucket
being a string like 2018-04-06:00:00:00
that contains all the events for that day. Then when you query like:
> CREATE TABLE example (time_bucket text, A int, B int, C int, D int, PRIMARY KEY ((time_bucket), A, B, C)) WITH CLUSTERING ORDER BY (A ASC, B ASC, C DESC);
> INSERT INTO example (time_bucket, A, B, C, D) VALUES ('2018-04', 1, 1, 100, 999);
> INSERT INTO example (time_bucket, A, B, C, D) VALUES ('2018-04', 1, 1, 120, 999);
> INSERT INTO example (time_bucket, A, B, C, D) VALUES ('2018-04', 1, 1, 130, 999);
> INSERT INTO example (time_bucket, A, B, C, D) VALUES ('2018-04', 1, 2, 130, 999);
> SELECT * FROM example WHERE time_bucket = '2018-04' GROUP BY time_bucket, A, B;
time_bucket | a | b | c | d
-------------+---+---+-----+-----
2018-04 | 1 | 1 | 130 | 999
2018-04 | 1 | 2 | 130 | 999
You would get the 1st result from each of the rows in the time bucket partition whose clustering by A and B. If you make the partitions small enough (use finer grain time buckets, like hours or 15 minutes or something, depending on data rate) its more acceptable to use ALLOW FILTERING here then like:
SELECT * FROM example WHERE time_bucket = '2018-04' AND A = 1 AND C < 120 GROUP BY time_bucket, A, B ALLOW FILTERING ;
time_bucket | a | b | c | d
-------------+---+---+-----+-----
2018-04 | 1 | 1 | 100 | 999
Because its all within one partition and within a bounded size (monitor it closely with tablestats/max partition size). Make sure always querying with time_bucket though so it doesnt become a range query. You want to make sure you do not end up going through too many things without returning a result (which is one of dangers of allow filtering).
Upvotes: 2