Reputation: 207

When does PostgreSQL collapse subqueries to joins and when not?

Considering the following query:

select a.id from a
where
    a.id in (select b.a_id from b where b.x='x1' and b.y='y1') and
    a.id in (select b.a_id from b where b.x='x2' and b.y='y2')
order by a.date desc
limit 20

Which should be rewritable to that faster one:

select a.id from a inner join b as b1 on (a.id=b1.a_id) inner join b as b2 on (a.id=b2.a_id)
where
    b1.x='x1' and b1.y='y1' and
    b2.x='x2' and b2.y='y2'
order by a.date desc
limit 20

We would prefer not to rewrite our queries by changing our source code as it complicates a lot (especially when using Django).

Thus, we wonder when PostgreSQL collapses subqueries to joins and when not?

That is the simplified data model:

                                      Table "public.a"
      Column       |          Type          |                          Modifiers
-------------------+------------------------+-------------------------------------------------------------
 id                | integer                | not null default nextval('a_id_seq'::regclass)
 date              | date                   | 
 content           | character varying(256) | 
Indexes:
    "a_pkey" PRIMARY KEY, btree (id)
    "a_id_date" btree (id, date)
Referenced by:
    TABLE "b" CONSTRAINT "a_id_refs_id_6e634433343d4435353" FOREIGN KEY (a_id) REFERENCES a(id) DEFERRABLE INITIALLY DEFERRED


       Table "public.b"
  Column  |   Type    | Modifiers 
----------+-----------+-----------
 a_id     | integer   | not null
 x        | text      | not null
 y        | text      | not null
Indexes:
    "b_x_y_a_id" UNIQUE CONSTRAINT, btree (x, y, a_id)
Foreign-key constraints:
    "a_id_refs_id_6e634433343d4435353" FOREIGN KEY (a_id) REFERENCES a(id) DEFERRABLE INITIALLY DEFERRED

a has 7 million rows
b has 70 million rows
cardinality of b.x = ~100
cardinality of b.y = ~100000
cardinality of b.x, b.y = ~150000
imagine tables c, d and e that have the same structure as b and could be used additionally to further reduce the resulting a.ids

Versions of PostgreSQL, we tested the queries.

PostgreSQL 9.2.7 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012], 64-bit
PostgreSQL 9.4beta1 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012], 64-bit

Query Plans (with empty file cache and mem cache):

Upvotes: 3

Answers (2)

wildplasser

Reputation: 44250

        -- The table definitions
CREATE TABLE table_a (
        id     SERIAL NOT NULL PRIMARY KEY
        , d      DATE NOT NULL
        );

CREATE TABLE table_b (
        id     SERIAL NOT NULL PRIMARY KEY
        , a_id INTEGER NOT NULL REFERENCES table_a(id)
        , x VARCHAR NOT NULL
        , y VARCHAR NOT NULL
        );
        -- fake some data
INSERT INTO table_a(d)
SELECT gs
FROM generate_series( '1904-01-01'::timestamp ,'2015-01-01'::timestamp, '1 day'::interval) gs;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x1' , 'y1' FROM table_a a;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x2' , 'y2' FROM table_a a;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x3' , 'y3' FROM table_a a;
DELETE FROM table_b WHERE RANDOM() > 0.3;

CREATE UNIQUE INDEX ON table_a(d, id);  -- date first
CREATE INDEX ON table_b(a_id);          -- supporting the FK

        -- For initialising the statistics
VACUUM ANALYZE table_a;
VACUUM ANALYZE table_b;

        -- original query
EXPLAIN ANALYZE
SELECT a.id
FROM table_a a
WHERE a.id IN (SELECT b.a_id FROM table_b b WHERE b.x='x1' AND b.y='y1')
  AND a.id IN (SELECT b.a_id FROM table_b b WHERE b.x='x2' AND b.y='y2')
order by a.d desc
limit 20;

        -- EXISTS() version
EXPLAIN ANALYZE
SELECT a.id
FROM table_a a
WHERE EXISTS (SELECT * FROM table_b b WHERE b.a_id= a.id AND b.x='x1' AND b.y='y1')
  AND EXISTS (SELECT * FROM table_b b WHERE b.a_id= a.id AND b.x='x2' AND b.y='y2')
order by a.d desc
limit 20;

Resulting query plan:

 Limit  (cost=0.87..491.23 rows=20 width=8) (actual time=0.080..0.521 rows=20 loops=1)
   ->  Nested Loop Semi Join  (cost=0.87..15741.40 rows=642 width=8) (actual time=0.080..0.518 rows=20 loops=1)
         ->  Nested Loop Semi Join  (cost=0.58..14380.54 rows=4043 width=12) (actual time=0.017..0.391 rows=74 loops=1)
               ->  Index Only Scan Backward using table_a_d_id_idx on table_a a  (cost=0.29..732.75 rows=40544 width=8) (actual time=0.008..0.048 rows=231 loops=1)
                     Heap Fetches: 0
               ->  Index Scan using table_b_a_id_idx on table_b b_1  (cost=0.29..0.34 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=231)
                     Index Cond: (a_id = a.id)
                     Filter: (((x)::text = 'x2'::text) AND ((y)::text = 'y2'::text))
                     Rows Removed by Filter: 0
         ->  Index Scan using table_b_a_id_idx on table_b b  (cost=0.29..0.34 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=74)
               Index Cond: (a_id = a.id)
               Filter: (((x)::text = 'x1'::text) AND ((y)::text = 'y1'::text))
               Rows Removed by Filter: 1
 Total runtime: 0.547 ms

The two queries cause exactly the same query plan and results (because of the NOT NULL on tableb.a_id )
The index table_b(a_id) is absolutely necessary once you prefer index joins over hashjoins (with 7M//70M tuples, I think you should prefer index scans)
The sort (expensive) in the outer query has been avoided (using the index table_a(d, id) )

Upvotes: 0

Denis de Bernardy

Reputation: 78561

Your last comment nails the reason, I think: The two queries are not equivalent unless a unique constraint kicks in to make them equivalent.

Example of an equivalent schema:

denis=# \d a
                         Table "public.a"
 Column |  Type   |                   Modifiers                    
--------+---------+------------------------------------------------
 id     | integer | not null default nextval('a_id_seq'::regclass)
 d      | date    | not null
Indexes:
    "a_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "b" CONSTRAINT "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

denis=# \d b
       Table "public.b"
 Column |  Type   | Modifiers 
--------+---------+-----------
 a_id   | integer | not null
 val    | integer | not null
Foreign-key constraints:
    "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

Equivalent offending data using that schema:

denis=# select * from a order by d;
 id |     d      
----+------------
  1 | 2014-12-10
  2 | 2014-12-11
  3 | 2014-12-12
  4 | 2014-12-13
  5 | 2014-12-14
  6 | 2014-12-15
(6 rows)

denis=# select * from b order by a_id, val;
 a_id | val 
------+-----
    1 |   1
    1 |   1
    2 |   1
    2 |   1
    2 |   2
    3 |   1
    3 |   1
    3 |   2
(8 rows)

Rows using two IN clauses:

denis=# select a.id, a.d from a where a.id in (select b.a_id from b where b.val = 1) and a.id in (select b.a_id from b where b.val = 2) order by d;
 id |     d      
----+------------
  2 | 2014-12-11
  3 | 2014-12-12
(2 rows)

Rows using two joins:

denis=# select a.id, a.d from a join b b1 on a.id = b1.a_id join b b2 on a.id = b2.a_id where b1.val = 1 and b2.val = 2 order by d;
 id |     d      
----+------------
  2 | 2014-12-11
  2 | 2014-12-11
  3 | 2014-12-12
  3 | 2014-12-12
(4 rows)

I see you've a unique constraint on b (a_id, x, y) already, though. Perhaps highlight the issue to the Postgres performance list to get the reason why it's not collapsed in your particular case -- or at least not generating the exact same plan.

Upvotes: 1

When does PostgreSQL collapse subqueries to joins and when not?

Answers (2)

Related Questions