Optimize query with condition on multiple tables

Question

I have two postgres table

Table A

id	owner_id
1	100
2	101

Table B

id	a_id	user_id
1	1	200
2	1	201
3	2	202
4	2	201

id on both tables are PK and integer

I have B-Tree index on a.owner_id , b.a_id , b.user_id

First Query

In the following query

SELECT b.id
FROM b JOIN a ON b.a_id = a.id

WHERE b.user_id = 201
   OR a.owner_id = 100
LIMIT 50;

I have WHERE b.user_id = 201 OR a.owner_id = 100 condition , index for b.user_id is used by query plan but index for a.owner_id is not used, Here is query plan

QUERY PLAN
Limit  (cost=19.54..4445.84 rows=50 width=4) (actual time=0.125..5.031 rows=50 loops=1)
  Buffers: shared hit=1054
  ->  Merge Join  (cost=19.54..9815083.22 rows=110872 width=4) (actual time=0.123..5.018 rows=50 loops=1)
        Merge Cond: (a.id = b.a_id)
        Join Filter: ((b.user_id = 201) OR (a.owner_id = 100))
        Rows Removed by Join Filter: 5547
        Buffers: shared hit=1054
        ->  Index Scan using a_pkey on a  (cost=0.42..103568.63 rows=100009 width=20) (actual time=0.011..0.037 rows=50 loops=1)
              Buffers: shared hit=10
        ->  Index Scan using b_a_id on b  (cost=0.43..9515274.99 rows=11200116 width=24) (actual time=0.009..3.136 rows=5597 loops=1)
              Buffers: shared hit=1044
Planning Time: 0.626 ms
Execution Time: 5.082 ms

The query is a little slow, How can I make it faster?

Second Query

Also have another slower query

SELECT b.id
FROM b JOIN a ON b.a_id = a.id

WHERE (b.user_id = 201 AND a.owner_id = 100)
   OR (b.user_id = 100 AND a.owner_id = 201)
LIMIT 50;

QUERY PLAN
Limit  (cost=1000.43..19742.38 rows=50 width=4) (actual time=0.705..63.142 rows=50 loops=1)
  Buffers: shared hit=1419 read=3994
  ->  Gather  (cost=1000.43..75593.36 rows=199 width=4) (actual time=0.704..63.124 rows=50 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=1419 read=3994
        ->  Nested Loop  (cost=0.43..74573.46 rows=83 width=4) (actual time=0.752..13.122 rows=17 loops=3)
              Buffers: shared hit=1419 read=3994
              ->  Parallel Seq Scan on a  (cost=0.00..25628.06 rows=83 width=20) (actual time=0.669..11.868 rows=17 loops=3)
                    Filter: ((owner_id = 100) OR (owner_id = 201))
                    Rows Removed by Filter: 16985
                    Buffers: shared hit=258 read=3994
              ->  Index Scan using b_a_id on b  (cost=0.43..589.69 rows=1 width=24) (actual time=0.023..0.070 rows=1 loops=52)
                    Index Cond: (a_id = a.id)
                    Filter: (((user_id = 201) OR (user_id = 100)) AND (((user_id = 201) AND (a.owner_id = 100)) OR ((a.owner_id = 201) AND (user_id = 100))))
                    Rows Removed by Filter: 105
                    Buffers: shared hit=1161
Planning Time: 0.638 ms
Execution Time: 63.202 ms

bobflux · Accepted Answer

Create test data...

CREATE UNLOGGED TABLE a AS SELECT a_id, (random()*100000)::INTEGER owner_id
FROM generate_series(1,1000000) a_id;
CREATE UNLOGGED TABLE b AS SELECT b_id, (random()*100000)::INTEGER a_id, (random()*100000)::INTEGER user_id
FROM generate_series(1,10000000) b_id;
CREATE INDEX a_o ON a(owner_id);
CREATE INDEX b_a ON b(a_id);
CREATE INDEX b_u ON b(user_id);
ALTER TABLE a ADD PRIMARY KEY(a_id);
ALTER TABLE b ADD PRIMARY KEY(b_id);
VACUUM ANALYZE a,b;

Problem with first query is postgres doesn't know how to optimize star join, so we have to give it a little help.

WITH ids AS (
  SELECT a_id FROM b WHERE user_id=201
  UNION SELECT a_id FROM a WHERE owner_id=100
)
SELECT * FROM ids JOIN b USING (a_id) LIMIT 50;

This gives a plan using both indices, it may be faster in your case, or maybe not.

 Limit  (cost=455.41..634.97 rows=50 width=12) (actual time=0.494..0.642 rows=50 loops=1)
   ->  Nested Loop  (cost=455.41..41596.19 rows=11456 width=12) (actual time=0.492..0.629 rows=50 loops=1)
         ->  HashAggregate  (cost=450.19..451.32 rows=113 width=4) (actual time=0.425..0.427 rows=1 loops=1)
               Group Key: b_1.a_id
               Batches: 1  Memory Usage: 24kB
               ->  Append  (cost=5.23..449.91 rows=113 width=4) (actual time=0.076..0.358 rows=98 loops=1)
                     ->  Bitmap Heap Scan on b b_1  (cost=5.23..401.21 rows=102 width=4) (actual time=0.075..0.299 rows=92 loops=1)
                           Recheck Cond: (user_id = 201)
                           Heap Blocks: exact=92
                           ->  Bitmap Index Scan on b_u  (cost=0.00..5.20 rows=102 width=0) (actual time=0.035..0.035 rows=92 loops=1)
                                 Index Cond: (user_id = 201)
                     ->  Bitmap Heap Scan on a  (cost=4.51..47.00 rows=11 width=4) (actual time=0.019..0.033 rows=6 loops=1)
                           Recheck Cond: (owner_id = 100)
                           Heap Blocks: exact=6
                           ->  Bitmap Index Scan on a_o  (cost=0.00..4.51 rows=11 width=0) (actual time=0.014..0.014 rows=6 loops=1)
                                 Index Cond: (owner_id = 100)
         ->  Bitmap Heap Scan on b  (cost=5.22..363.09 rows=101 width=12) (actual time=0.059..0.174 rows=50 loops=1)
               Recheck Cond: (a_id = b_1.a_id)
               Heap Blocks: exact=50
               ->  Bitmap Index Scan on b_a  (cost=0.00..5.19 rows=101 width=0) (actual time=0.023..0.023 rows=104 loops=1)
                     Index Cond: (a_id = b_1.a_id)
 Planning Time: 0.448 ms
 Execution Time: 0.747 ms

As for the other query, I had to run this:

select owner_id, user_id, count(*) from a join b using (a_id) group by owner_id,user_id order by count(*) desc limit 100;

to get some user_id,owner_id that would actually return results from my test data. Then,

EXPLAIN ANALYZE
SELECT b.*
FROM b JOIN a USING (a_id)
WHERE (b.user_id = 99238 AND a.owner_id = 58599)
   OR (b.user_id = 36859 AND a.owner_id = 99027)
LIMIT 50;

Limit  (cost=24.97..532.32 rows=1 width=12) (actual time=0.274..0.982 rows=6 loops=1)
   ->  Nested Loop  (cost=24.97..532.32 rows=1 width=12) (actual time=0.271..0.976 rows=6 loops=1)
         ->  Bitmap Heap Scan on a  (cost=9.03..92.70 rows=22 width=8) (actual time=0.108..0.216 rows=12 loops=1)
               Recheck Cond: ((owner_id = 58599) OR (owner_id = 99027))
               Heap Blocks: exact=12
               ->  BitmapOr  (cost=9.03..9.03 rows=22 width=0) (actual time=0.086..0.088 rows=0 loops=1)
                     ->  Bitmap Index Scan on a_o  (cost=0.00..4.51 rows=11 width=0) (actual time=0.064..0.065 rows=3 loops=1)
                           Index Cond: (owner_id = 58599)
                     ->  Bitmap Index Scan on a_o  (cost=0.00..4.51 rows=11 width=0) (actual time=0.020..0.020 rows=9 loops=1)
                           Index Cond: (owner_id = 99027)
         ->  Bitmap Heap Scan on b  (cost=15.95..19.97 rows=1 width=12) (actual time=0.058..0.060 rows=0 loops=12)
               Recheck Cond: ((a_id = a.a_id) AND ((user_id = 99238) OR (user_id = 36859)))
               Filter: (((user_id = 99238) AND (a.owner_id = 58599)) OR ((user_id = 36859) AND (a.owner_id = 99027)))
               Heap Blocks: exact=6
               ->  BitmapAnd  (cost=15.95..15.95 rows=1 width=0) (actual time=0.053..0.053 rows=0 loops=12)
                     ->  Bitmap Index Scan on b_a  (cost=0.00..5.19 rows=101 width=0) (actual time=0.015..0.015 rows=50 loops=12)
                           Index Cond: (a_id = a.a_id)
                     ->  BitmapOr  (cost=10.50..10.50 rows=205 width=0) (actual time=0.046..0.046 rows=0 loops=6)
                           ->  Bitmap Index Scan on b_u  (cost=0.00..5.20 rows=102 width=0) (actual time=0.021..0.021 rows=121 loops=6)
                                 Index Cond: (user_id = 99238)
                           ->  Bitmap Index Scan on b_u  (cost=0.00..5.20 rows=102 width=0) (actual time=0.024..0.024 rows=105 loops=6)
                                 Index Cond: (user_id = 36859)
 Planning Time: 0.703 ms
 Execution Time: 1.063 ms

It doesn't use a seq scan as yours do, so maybe you have an old version that couldn't optimize this properly? It's quite weird that it picks a seq scan for table a when the row count estimates are pretty accurate. You should investigate, maybe try

SELECT * FROM a WHERE a.owner_id = 58599 OR a.owner_id = 99027
LIMIT 50;

this should give an index or bitmap index scan, if it does a seq scan, then you have a small test case to find out why. Anyway, you can still force use of indices with:

EXPLAIN ANALYZE
WITH ids AS (
  SELECT a_id FROM b WHERE user_id IN (99238,36859)
  UNION SELECT a_id FROM a WHERE owner_id IN (58599,99027)
)
SELECT * FROM ids JOIN b USING (a_id) JOIN a USING (a_id)
    WHERE (b.user_id = 99238 AND a.owner_id = 58599)
       OR (b.user_id = 36859 AND a.owner_id = 99027);

...but it's pretty damn ugly. Or you could do each clause in your OR separately and do the AND manyally with this one, which is also ugly:

EXPLAIN ANALYZE
SELECT a_id FROM b WHERE b.user_id = 99238 
INTERSECT
SELECT a_id FROM a WHERE a.owner_id = 58599
LIMIT 50;

How can I optimize large OFFSETs

You don't, in fact when large offsets are used it usually hints that you're doing it wrong,by repeatedly doing the same query, for example for pagination, and displaying chunks of results. There are two solutions. If results will be fetched quickly enough that a transaction can stay open while you do that, open a cursor for the query without LIMIT or OFFSET and use FETCH to get results in chunks. Otherwise, do the query once without LIMIT, store the results in a cache, and paginate from that without redoing the query.

Optimize query with condition on multiple tables

Table A

Table B

First Query

Second Query

Answers (2)

Related Questions