Update with subquery vs Update with join - Which is better in performance

Question

In a Postgres database, are the two following queries equivalent in performance? Everyone says "joins are always faster than subqueries", but does the Postgres query planner optimize subqueries into joins behind the scenes?

Query 1:

UPDATE table_a
SET col_1 = 'a fixed value'
WHERE col_2 IN (
    SELECT col_2 FROM table_b
);

Explain Plan:

Update on table_a  (cost=0.00..9316.10 rows=1 width=827)
  ->  Nested Loop Semi Join  (cost=0.00..9316.10 rows=1 width=827)
        ->  Seq Scan on table_a  (cost=0.00..9287.20 rows=1 width=821)
        ->  Index Scan using idx_table_b on table_b  (cost=0.00..14.45 rows=1 width=14)
              Index Cond: (col_2 = (table_a.col_2)::numeric)

Query 2:

UPDATE table_a ta
SET col_1 = 'a fixed value'
FROM table_b tb
WHERE ta.col_2 = tb.col_2;

Explain Plan:

Update on table_a ta  (cost=0.00..9301.67 rows=1 width=827)
  ->  Nested Loop  (cost=0.00..9301.67 rows=1 width=827)
        ->  Seq Scan on table_a ta  (cost=0.00..9287.20 rows=1 width=821)
        ->  Index Scan using idx_table_b on table_b tb  (cost=0.00..14.45 rows=1 width=14)
              Index Cond: (col_2 = (ta.col_2)::numeric)

I believe they are equivalent in results (Please alert me if I'm wrong). I have tried several explain plans with large sets of data. It seems they are equivalent in performance, both when updating full table and when restricting table_a.col_2 to a small subset.

I want to be sure I'm not missing something. If they are equivalent, which would you choose and why?

Craig Ringer · Accepted Answer

does the Postgres query planner optimize subqueries into joins behind the scenes?

Usually, yes.

Don't guess, look at the query plans.

Given:

CREATE TABLE table_a (col_1 text, col_2 integer );
CREATE TABLE table_b (col_2 integer);
INSERT INTO table_b(col_2) VALUES (1),(2),(4),(NULL);
INSERT INTO table_a (col_1, col_2) VALUES ('a fixed value', 2), ('a fixed value', NULL), ('some other value', 2);
ANALYZE table_a;
ANALYZE table_b;

compare:

regress=> explain UPDATE table_a
SET col_1 = 'a fixed value'
WHERE col_2 IN (
    SELECT col_2 FROM table_b
);
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Update on table_a  (cost=1.09..2.15 rows=2 width=16)
   ->  Hash Semi Join  (cost=1.09..2.15 rows=2 width=16)
         Hash Cond: (table_a.col_2 = table_b.col_2)
         ->  Seq Scan on table_a  (cost=0.00..1.03 rows=3 width=10)
         ->  Hash  (cost=1.04..1.04 rows=4 width=10)
               ->  Seq Scan on table_b  (cost=0.00..1.04 rows=4 width=10)
(6 rows)

regress=> explain UPDATE table_a ta
regress-> SET col_1 = 'a fixed value'
regress-> FROM table_b tb
regress-> WHERE ta.col_2 = tb.col_2;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Update on table_a ta  (cost=1.07..2.14 rows=1 width=16)
   ->  Hash Join  (cost=1.07..2.14 rows=1 width=16)
         Hash Cond: (tb.col_2 = ta.col_2)
         ->  Seq Scan on table_b tb  (cost=0.00..1.04 rows=4 width=10)
         ->  Hash  (cost=1.03..1.03 rows=3 width=10)
               ->  Seq Scan on table_a ta  (cost=0.00..1.03 rows=3 width=10)
(6 rows)

See? Same plan. The subquery has been transformed into a join.

It's usually cleaner to express it with EXISTS rather than IN. This is much more important for NOT IN vs NOT EXISTS, where they're semantically different in the face of nulls, so it's a good habit anyway. You'd write:

UPDATE table_a a
SET col_1 = 'a fixed value'
WHERE EXISTS (SELECT 1 FROM table_b b WHERE b.col_2 = a.col_2);

This will also tend to produce the same plan, but it's a bit nicer IMO - not least because if it doesn't plan down to a join, a correlated subquery is generally less horrible than a giant IN list scan.

Update with subquery vs Update with join - Which is better in performance

Answers (2)

Related Questions