Optimize query with multiple "between" conditions

Question

I have a table playground with column val, column val is indexed.

I have a list of ranges [(min1, max1), (min2, max2), ... , (minN, maxN)] and I want to select all rows with val that fit in any of those ranges.

E.g. my ranges looks like that: [(1,5), (20,25), (200,400)] Here is the simple query that extracts corresponding rows:

select p.*
from playground p
where (val between 1 AND 5) or (val between 20 and 25) or
    (val between 200 and 400);

The problem here is that this list of ranges is dynamic, my application generates it and sends it along with the query to postgres.

I tried to rewrite the query to accept dynamic list of ranges:

select p.*
from playground p,
    unnest(ARRAY [(1, 5),(20, 25),(200, 400)]) as r(min_val INT, max_val INT)
where p.val between r.min_val and r.max_val;

It extracts the same rows, but I don't know is an effective query or not?

This is how the explain looks like for the first query:

Bitmap Heap Scan on playground p  (cost=12.43..16.45 rows=1 width=36) (actual time=0.017..0.018 rows=4 loops=1)
  Recheck Cond: (((val >= 1) AND (val <= 5)) OR ((val >= 20) AND (val <= 25)) OR ((val >= 200) AND (val <= 400)))
  Heap Blocks: exact=1
  ->  BitmapOr  (cost=12.43..12.43 rows=1 width=0) (actual time=0.012..0.012 rows=0 loops=1)
        ->  Bitmap Index Scan on playground_val_index  (cost=0.00..4.14 rows=1 width=0) (actual time=0.010..0.010 rows=3 loops=1)
              Index Cond: ((val >= 1) AND (val <= 5))
        ->  Bitmap Index Scan on playground_val_index  (cost=0.00..4.14 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
              Index Cond: ((val >= 20) AND (val <= 25))
        ->  Bitmap Index Scan on playground_val_index  (cost=0.00..4.14 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
              Index Cond: ((val >= 200) AND (val <= 400))
Planning Time: 0.071 ms
Execution Time: 0.057 ms

And here is the explain for the second:

Nested Loop  (cost=0.14..12.52 rows=2 width=36) (actual time=0.033..0.065 rows=4 loops=1)
  ->  Function Scan on unnest r  (cost=0.00..0.03 rows=3 width=8) (actual time=0.011..0.012 rows=3 loops=1)
  ->  Index Scan using playground_val_index on playground p  (cost=0.13..4.15 rows=1 width=36) (actual time=0.008..0.015 rows=1 loops=3)
        Index Cond: ((val >= r.min_val) AND (val <= r.max_val))
Planning Time: 0.148 ms
Execution Time: 0.714 ms

NOTE: In both cases I did set enable_seqscan = false; to make the index work.

I am worried about the "Nested Loop" stage. Is it Okay? Or there are more effective ways to pass dynamic list of ranges into a query? My postgres version is 12.1

Erwin Brandstetter · Accepted Answer

You added more information, but much more is relevant, yet. Exact table and index definition, cardinality, data distribution, row size stats, number of ranges in predicate, purpose of the table, write patterns, ... Performance optimization needs all the input it can get.

Shot in the dark: with non-overlapping ranges, a UNION ALL query may deliver best performance:

SELECT * FROM playground WHERE val BETWEEN   1 AND   5
UNION ALL
SELECT * FROM playground WHERE val BETWEEN  20 AND  25
UNION ALL
SELECT * FROM playground WHERE val BETWEEN 200 AND 400;

We know that ranges don't overlap, but Postgres doesn't, so it has to do extra work in your attempts. This query should avoid both the BitmapOr of the first as well as the Nested Loop of the second plan. Just fetch each range and append to the output. Should result in a plan like:

Append  (cost=0.13..24.50 rows=3 width=40)
  ->  Index Scan using playground_val_idx on playground  (cost=0.13..8.15 rows=1 width=40)
        Index Cond: ((val >= 1) AND (val <= 5))
  ->  Index Scan using playground_val_idx on playground playground_1  (cost=0.13..8.15 rows=1 width=40)
        Index Cond: ((val >= 20) AND (val <= 25))
  ->  Index Scan using playground_val_idx on playground playground_2  (cost=0.13..8.15 rows=1 width=40)
        Index Cond: ((val >= 200) AND (val <= 400))

Plus, each sub-SELECT will be based on actual statistics for the given range, not generic estimates, even for a longer list of ranges. See (recommended!):

How to use index for simple time range join?

You can generate the query in your client or write a server-side function to generate and execute dynamic SQL (applicable as the result type is known).

You might even test a server-side function using a LOOP (which is often less efficient, but this may be an exception):

CREATE OR REPLACE FUNCTION foo(_ranges int[])
  RETURNS SETOF playground LANGUAGE plpgsql PARALLEL SAFE STABLE AS
$func$
DECLARE
   _range   int[];
BEGIN
   FOREACH _range SLICE 1 IN ARRAY _ranges
   LOOP
      RETURN QUERY
      SELECT * FROM playground WHERE val BETWEEN _range[1] AND _range[2];
   END LOOP;
END
$func$;

The overhead may not pay for few ranges in the call. But very convenient to call, if nothing else:

SELECT * FROM foo('{{1,5},{20,25},{200,400}}');

Loop over array dimension in plpgsql

db<>fiddle here

Physical order of rows may help a lot. If rows are stored in sequence, (much) fewer data pages need to be processed. Depends on undisclosed details. Built-in CLUSTER or the extensions pg_repack or pg_squeeze may help with that. Related:

Optimize Postgres timestamp query range

And it's recommended to use the latest available minor release for whatever major version is in use. That would be 12.2 at the time of writing (released 2020-02-13).

Optimize query with multiple "between" conditions

Answers (1)

Related Questions

Optimize query with multiple &quot;between&quot; conditions

Answers (1)

Related Questions

Optimize query with multiple "between" conditions