Reputation: 22803
I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.
Query #1:
SELECT id
FROM subsource_position
WHERE
id NOT IN (SELECT position_id FROM subsource)
This comes back with the following plan:
QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on subsource_position (cost=0.00..362486535.10 rows=128524 width=4)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..2566.50 rows=101500 width=4)
-> Seq Scan on subsource (cost=0.00..1662.00 rows=101500 width=4)
Query #2:
SELECT id FROM subsource_position
EXCEPT
SELECT position_id FROM subsource;
Plan:
QUERY PLAN
-------------------------------------------------------------------------------------------------
SetOp Except (cost=24760.35..25668.66 rows=95997 width=4)
-> Sort (cost=24760.35..25214.50 rows=181663 width=4)
Sort Key: "*SELECT* 1".id
-> Append (cost=0.00..6406.26 rows=181663 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..4146.94 rows=95997 width=4)
-> Seq Scan on subsource_position (cost=0.00..3186.97 rows=95997 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..2259.32 rows=85666 width=4)
-> Seq Scan on subsource (cost=0.00..1402.66 rows=85666 width=4)
(8 rows)
I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this NOT IN
to optimize well; is NOT IN
always a performance problem or is there a reason it does not optimize here?
Additional data:
=> select count(*) from subsource;
count
-------
85158
(1 row)
=> select count(*) from subsource_position;
count
-------
93261
(1 row)
Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.
Edit 2: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.
Edit 3: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.
Upvotes: 34
Views: 24257
Reputation: 1497
Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (Seq Scan
).
Not having EXCEPT, the alternative is to use a JOIN (HASH JOIN
):
SELECT sp.id
FROM subsource_position AS sp
LEFT JOIN subsource AS s ON (s.position_id = sp.id)
WHERE
s.position_id IS NULL
EXCEPT appeared in Postgres long time ago... But using MySQL I believe this is still the only way, using indexes, to achieve this.
Upvotes: 27
Reputation: 434675
Your queries are not functionally equivalent so any comparison of their query plans is meaningless.
Your first query is, in set theory terms, this:
{subsource.position_id} - {subsource_position.id}
^ ^ ^ ^
but your second is this:
{subsource_position.id} - {subsource.position_id}
^ ^ ^ ^
And A - B
is not the same as B - A
for arbitrary sets A
and B
.
Fix your queries to be semantically equivalent and try again.
Upvotes: 7
Reputation: 42152
If id
and position_id
are both indexed (either on their own or first column in a multi-column index), then two index scans are all that are necessary - it's a trivial sorted-merge based set algorithm.
Personally I think PostgreSQL simply doesn't have the optimization intelligence to understand this.
(I came to this question after diagnosing a query running for over 24 hours that I could perform with sort x y y | uniq -u
on the command line in seconds. Database less than 50MB when exported with pg_dump.)
PS: more interesting comment here:
more work has been put into optimizing EXCEPT and NOT EXISTS than NOT IN, because the latter is substantially less useful due to its unintuitive but spec-mandated handling of NULLs. We're not going to apologize for that, and we're not going to regard it as a bug.
What it comes down to is that except
is different to not in
with respect to null handling. I haven't looked up the details, but it means PostgreSQL (aggressively) doesn't optimize it.
Upvotes: 4
Reputation: 1870
The second query makes usage of the HASH JOIN
feature of postgresql. This is much faster then the Seq Scan
of the first one.
Upvotes: 2
Reputation: 25098
Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.
Upvotes: 7