Reputation: 9337
Using Postgres v. 8.4.2.
I have a standard three-table setup for a many-to-many relationship. A query, written against two of these tables, is supposed to return a single record for which there is no existing mapping:
SELECT b.id
FROM scm_branch b
LEFT OUTER JOIN "scm_branchgroup_branches" bgb
ON b.id = bgb.branch_id
WHERE
bgb.branch_id is NULL
LIMIT 1
(scm_branchgroup_branches is the ternary table keeping the mapping records)
this query has the following top level cost:
Limit (cost=0.00..0.22 rows=1 width=4)
However if I change the field used in the WHERE condition to be the primary key of the bgb table, the cost goes drastically up:
SELECT b.id
FROM scm_branch b
LEFT OUTER JOIN "scm_branchgroup_branches" bgb
ON b.id = bgb.branch_id
WHERE
bgb.id IS NULL
LIMIT 1
(note: this query uses "bgb.id IS NULL" vs "bgb.branch_id is NULL" used in the previous example)
Top level cost:
Limit (cost=236366.74..4644900.75 rows=1 width=4)
Why such a drastic difference? I mean I see the difference in execution plans, but I don't understand the underlying reasons for the difference.
I'll be happy to provide any additional info needed to answer this question.
Thanks D.
Upvotes: 1
Views: 424
Reputation: 183321
In the version with branch_id IS NULL
, PostgreSQL recognizes that you're doing an anti-join, and it optimizes accordingly, using a "hash-join"-type algorithm. (Google PostgreSQL hash left anti-join
for lots of information on this optimization.) In the version with id IS NULL
, however, it doesn't recognize this fact, and its optimizations are not nearly so helpful. What's more, the join basically forces it to use the index on branch_id
, so it can't take advantage of the primary-key index on id
, and must consult the actual table data to determine if a given record satisfies the WHERE-clause.
(There may well be other factors at play as well — I'm not an expert on PostgreSQL — but I believe those are the main factors.)
Upvotes: 1