Reputation: 10086
In a Postgres database, are the two following queries equivalent in performance? Everyone says "joins are always faster than subqueries", but does the Postgres query planner optimize subqueries into joins behind the scenes?
Query 1:
UPDATE table_a
SET col_1 = 'a fixed value'
WHERE col_2 IN (
SELECT col_2 FROM table_b
);
Explain Plan:
Update on table_a (cost=0.00..9316.10 rows=1 width=827) -> Nested Loop Semi Join (cost=0.00..9316.10 rows=1 width=827) -> Seq Scan on table_a (cost=0.00..9287.20 rows=1 width=821) -> Index Scan using idx_table_b on table_b (cost=0.00..14.45 rows=1 width=14) Index Cond: (col_2 = (table_a.col_2)::numeric)
Query 2:
UPDATE table_a ta
SET col_1 = 'a fixed value'
FROM table_b tb
WHERE ta.col_2 = tb.col_2;
Explain Plan:
Update on table_a ta (cost=0.00..9301.67 rows=1 width=827) -> Nested Loop (cost=0.00..9301.67 rows=1 width=827) -> Seq Scan on table_a ta (cost=0.00..9287.20 rows=1 width=821) -> Index Scan using idx_table_b on table_b tb (cost=0.00..14.45 rows=1 width=14) Index Cond: (col_2 = (ta.col_2)::numeric)
I believe they are equivalent in results (Please alert me if I'm wrong). I have tried several explain plans with large sets of data. It seems they are equivalent in performance, both when updating full table and when restricting table_a.col_2
to a small subset.
I want to be sure I'm not missing something. If they are equivalent, which would you choose and why?
Upvotes: 2
Views: 6547
Reputation: 324455
does the Postgres query planner optimize subqueries into joins behind the scenes?
Usually, yes.
Don't guess, look at the query plans.
Given:
CREATE TABLE table_a (col_1 text, col_2 integer ); CREATE TABLE table_b (col_2 integer); INSERT INTO table_b(col_2) VALUES (1),(2),(4),(NULL); INSERT INTO table_a (col_1, col_2) VALUES ('a fixed value', 2), ('a fixed value', NULL), ('some other value', 2); ANALYZE table_a; ANALYZE table_b;
compare:
regress=> explain UPDATE table_a SET col_1 = 'a fixed value' WHERE col_2 IN ( SELECT col_2 FROM table_b ); QUERY PLAN -------------------------------------------------------------------------- Update on table_a (cost=1.09..2.15 rows=2 width=16) -> Hash Semi Join (cost=1.09..2.15 rows=2 width=16) Hash Cond: (table_a.col_2 = table_b.col_2) -> Seq Scan on table_a (cost=0.00..1.03 rows=3 width=10) -> Hash (cost=1.04..1.04 rows=4 width=10) -> Seq Scan on table_b (cost=0.00..1.04 rows=4 width=10) (6 rows) regress=> explain UPDATE table_a ta regress-> SET col_1 = 'a fixed value' regress-> FROM table_b tb regress-> WHERE ta.col_2 = tb.col_2; QUERY PLAN ----------------------------------------------------------------------------- Update on table_a ta (cost=1.07..2.14 rows=1 width=16) -> Hash Join (cost=1.07..2.14 rows=1 width=16) Hash Cond: (tb.col_2 = ta.col_2) -> Seq Scan on table_b tb (cost=0.00..1.04 rows=4 width=10) -> Hash (cost=1.03..1.03 rows=3 width=10) -> Seq Scan on table_a ta (cost=0.00..1.03 rows=3 width=10) (6 rows)
See? Same plan. The subquery has been transformed into a join.
It's usually cleaner to express it with EXISTS
rather than IN
. This is much more important for NOT IN
vs NOT EXISTS
, where they're semantically different in the face of nulls, so it's a good habit anyway. You'd write:
UPDATE table_a a
SET col_1 = 'a fixed value'
WHERE EXISTS (SELECT 1 FROM table_b b WHERE b.col_2 = a.col_2);
This will also tend to produce the same plan, but it's a bit nicer IMO - not least because if it doesn't plan down to a join, a correlated subquery is generally less horrible than a giant IN
list scan.
Upvotes: 3
Reputation: 15387
IN:
Returns true if a specified value matches any value in a subquery or a list.
Exists:
Returns true if a subquery contains any rows.
Join:
Joins 2 resultsets on the joining column.
If the joining column is UNIQUE
then join
in faster .
If it's not, then IN is faster than JOIN on DISTINCT
.
See this article in my blog for performance details for Microsoft SQL Server, which might also be relevant to PostgreSQL:
Upvotes: 2