Daniel Lyons
Daniel Lyons

Reputation: 22803

PostgreSQL: NOT IN versus EXCEPT performance difference (edited #2)

I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.

Query #1:

SELECT id 
FROM subsource_position
WHERE
  id NOT IN (SELECT position_id FROM subsource)

This comes back with the following plan:

                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Seq Scan on subsource_position  (cost=0.00..362486535.10 rows=128524 width=4)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=0.00..2566.50 rows=101500 width=4)
           ->  Seq Scan on subsource  (cost=0.00..1662.00 rows=101500 width=4)

Query #2:

SELECT id FROM subsource_position
EXCEPT
SELECT position_id FROM subsource;

Plan:

                                           QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
 SetOp Except  (cost=24760.35..25668.66 rows=95997 width=4)
   ->  Sort  (cost=24760.35..25214.50 rows=181663 width=4)
         Sort Key: "*SELECT* 1".id
         ->  Append  (cost=0.00..6406.26 rows=181663 width=4)
               ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..4146.94 rows=95997 width=4)
                     ->  Seq Scan on subsource_position  (cost=0.00..3186.97 rows=95997 width=4)
               ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..2259.32 rows=85666 width=4)
                     ->  Seq Scan on subsource  (cost=0.00..1402.66 rows=85666 width=4)
(8 rows)

I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this NOT IN to optimize well; is NOT IN always a performance problem or is there a reason it does not optimize here?

Additional data:

=> select count(*) from subsource;
 count 
-------
 85158
(1 row)

=> select count(*) from subsource_position;
 count 
-------
 93261
(1 row)

Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.

Edit 2: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.

Edit 3: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.

Upvotes: 34

Views: 24257

Answers (5)

Antony Gibbs
Antony Gibbs

Reputation: 1497

Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (Seq Scan).

Not having EXCEPT, the alternative is to use a JOIN (HASH JOIN):

    SELECT sp.id
    FROM subsource_position AS sp
        LEFT JOIN subsource AS s ON (s.position_id = sp.id)
    WHERE
        s.position_id IS NULL

EXCEPT appeared in Postgres long time ago... But using MySQL I believe this is still the only way, using indexes, to achieve this.

Upvotes: 27

mu is too short
mu is too short

Reputation: 434675

Your queries are not functionally equivalent so any comparison of their query plans is meaningless.

Your first query is, in set theory terms, this:

{subsource.position_id} - {subsource_position.id}
          ^        ^                ^        ^

but your second is this:

{subsource_position.id} - {subsource.position_id}
          ^        ^                ^        ^

And A - B is not the same as B - A for arbitrary sets A and B.

Fix your queries to be semantically equivalent and try again.

Upvotes: 7

Barry Kelly
Barry Kelly

Reputation: 42152

If id and position_id are both indexed (either on their own or first column in a multi-column index), then two index scans are all that are necessary - it's a trivial sorted-merge based set algorithm.

Personally I think PostgreSQL simply doesn't have the optimization intelligence to understand this.

(I came to this question after diagnosing a query running for over 24 hours that I could perform with sort x y y | uniq -u on the command line in seconds. Database less than 50MB when exported with pg_dump.)

PS: more interesting comment here:

more work has been put into optimizing EXCEPT and NOT EXISTS than NOT IN, because the latter is substantially less useful due to its unintuitive but spec-mandated handling of NULLs. We're not going to apologize for that, and we're not going to regard it as a bug.

What it comes down to is that except is different to not in with respect to null handling. I haven't looked up the details, but it means PostgreSQL (aggressively) doesn't optimize it.

Upvotes: 4

Thomas Berger
Thomas Berger

Reputation: 1870

The second query makes usage of the HASH JOIN feature of postgresql. This is much faster then the Seq Scan of the first one.

Upvotes: 2

Magnus Hagander
Magnus Hagander

Reputation: 25098

Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.

Upvotes: 7

Related Questions