Reputation: 689
I have a table playground
with column val
, column val
is indexed.
I have a list of ranges [(min1, max1), (min2, max2), ... , (minN, maxN)]
and I want to select all rows with val
that fit in any of those ranges.
E.g. my ranges looks like that: [(1,5), (20,25), (200,400)]
Here is the simple query that extracts corresponding rows:
select p.*
from playground p
where (val between 1 AND 5) or (val between 20 and 25) or
(val between 200 and 400);
The problem here is that this list of ranges is dynamic, my application generates it and sends it along with the query to postgres.
I tried to rewrite the query to accept dynamic list of ranges:
select p.*
from playground p,
unnest(ARRAY [(1, 5),(20, 25),(200, 400)]) as r(min_val INT, max_val INT)
where p.val between r.min_val and r.max_val;
It extracts the same rows, but I don't know is an effective query or not?
This is how the explain looks like for the first query:
Bitmap Heap Scan on playground p (cost=12.43..16.45 rows=1 width=36) (actual time=0.017..0.018 rows=4 loops=1)
Recheck Cond: (((val >= 1) AND (val <= 5)) OR ((val >= 20) AND (val <= 25)) OR ((val >= 200) AND (val <= 400)))
Heap Blocks: exact=1
-> BitmapOr (cost=12.43..12.43 rows=1 width=0) (actual time=0.012..0.012 rows=0 loops=1)
-> Bitmap Index Scan on playground_val_index (cost=0.00..4.14 rows=1 width=0) (actual time=0.010..0.010 rows=3 loops=1)
Index Cond: ((val >= 1) AND (val <= 5))
-> Bitmap Index Scan on playground_val_index (cost=0.00..4.14 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
Index Cond: ((val >= 20) AND (val <= 25))
-> Bitmap Index Scan on playground_val_index (cost=0.00..4.14 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Index Cond: ((val >= 200) AND (val <= 400))
Planning Time: 0.071 ms
Execution Time: 0.057 ms
And here is the explain for the second:
Nested Loop (cost=0.14..12.52 rows=2 width=36) (actual time=0.033..0.065 rows=4 loops=1)
-> Function Scan on unnest r (cost=0.00..0.03 rows=3 width=8) (actual time=0.011..0.012 rows=3 loops=1)
-> Index Scan using playground_val_index on playground p (cost=0.13..4.15 rows=1 width=36) (actual time=0.008..0.015 rows=1 loops=3)
Index Cond: ((val >= r.min_val) AND (val <= r.max_val))
Planning Time: 0.148 ms
Execution Time: 0.714 ms
NOTE: In both cases I did set enable_seqscan = false;
to make the index work.
I am worried about the "Nested Loop" stage. Is it Okay? Or there are more effective ways to pass dynamic list of ranges into a query?
My postgres version is 12.1
Upvotes: 1
Views: 544
Reputation: 656942
You added more information, but much more is relevant, yet. Exact table and index definition, cardinality, data distribution, row size stats, number of ranges in predicate, purpose of the table, write patterns, ... Performance optimization needs all the input it can get.
Shot in the dark: with non-overlapping ranges, a UNION ALL
query may deliver best performance:
SELECT * FROM playground WHERE val BETWEEN 1 AND 5
UNION ALL
SELECT * FROM playground WHERE val BETWEEN 20 AND 25
UNION ALL
SELECT * FROM playground WHERE val BETWEEN 200 AND 400;
We know that ranges don't overlap, but Postgres doesn't, so it has to do extra work in your attempts. This query should avoid both the BitmapOr
of the first as well as the Nested Loop
of the second plan. Just fetch each range and append to the output. Should result in a plan like:
Append (cost=0.13..24.50 rows=3 width=40)
-> Index Scan using playground_val_idx on playground (cost=0.13..8.15 rows=1 width=40)
Index Cond: ((val >= 1) AND (val <= 5))
-> Index Scan using playground_val_idx on playground playground_1 (cost=0.13..8.15 rows=1 width=40)
Index Cond: ((val >= 20) AND (val <= 25))
-> Index Scan using playground_val_idx on playground playground_2 (cost=0.13..8.15 rows=1 width=40)
Index Cond: ((val >= 200) AND (val <= 400))
Plus, each sub-SELECT
will be based on actual statistics for the given range, not generic estimates, even for a longer list of ranges. See (recommended!):
You can generate the query in your client or write a server-side function to generate and execute dynamic SQL (applicable as the result type is known).
You might even test a server-side function using a LOOP
(which is often less efficient, but this may be an exception):
CREATE OR REPLACE FUNCTION foo(_ranges int[])
RETURNS SETOF playground LANGUAGE plpgsql PARALLEL SAFE STABLE AS
$func$
DECLARE
_range int[];
BEGIN
FOREACH _range SLICE 1 IN ARRAY _ranges
LOOP
RETURN QUERY
SELECT * FROM playground WHERE val BETWEEN _range[1] AND _range[2];
END LOOP;
END
$func$;
The overhead may not pay for few ranges in the call. But very convenient to call, if nothing else:
SELECT * FROM foo('{{1,5},{20,25},{200,400}}');
Related:
db<>fiddle here
Physical order of rows may help a lot. If rows are stored in sequence, (much) fewer data pages need to be processed. Depends on undisclosed details. Built-in CLUSTER
or the extensions pg_repack
or pg_squeeze
may help with that. Related:
And it's recommended to use the latest available minor release for whatever major version is in use. That would be 12.2 at the time of writing (released 2020-02-13).
Upvotes: 1