Xenon
Xenon

Reputation: 304

Why is NOT IN much slower than IN in SQL query

I have found a surprising (at least to me) thing with IN and NOT IN. When I try to explain first query on PostgreSQL database:

EXPLAIN DELETE
FROM AuditTaskImpl l
WHERE  l.processInstanceId in (select spl.processInstanceId
                               FROM ProcessInstanceLog spl
                               WHERE spl.status not in ( 2, 3))

it tells me this:

Delete on audittaskimpl l  (cost=2794.48..6373.52 rows=50859 width=12)
  ->  Hash Semi Join  (cost=2794.48..6373.52 rows=50859 width=12)
        Hash Cond: (l.processinstanceid = spl.processinstanceid)
        ->  Seq Scan on audittaskimpl l  (cost=0.00..2005.59 rows=50859 width=14)
        ->  Hash  (cost=1909.24..1909.24 rows=50899 width=14)
              ->  Seq Scan on processinstancelog spl  (cost=0.00..1909.24 rows=50899 width=14)
                    Filter: (status <> ALL ('{2,3}'::integer[]))

However, when I change in for not in, which is just a negation:

EXPLAIN DELETE
FROM AuditTaskImpl l
WHERE  l.processInstanceId NOT in (select spl.processInstanceId
                               FROM ProcessInstanceLog spl
                               WHERE spl.status not in ( 2, 3))

it tells me this:

Delete on audittaskimpl l  (cost=0.00..63321079.15 rows=25430 width=6)
  ->  Seq Scan on audittaskimpl l  (cost=0.00..63321079.15 rows=25430 width=6)
        Filter: (NOT (SubPlan 1))
        SubPlan 1
          ->  Materialize  (cost=0.00..2362.73 rows=50899 width=8)
                ->  Seq Scan on processinstancelog spl  (cost=0.00..1909.24 rows=50899 width=8)
                      Filter: (status <> ALL ('{2,3}'::integer[]))

As you can see, with IN it uses hash join, which is of course much quicker, but with NOT IN it uses just plain sequential scan row by row. But since NOT IN is just a negation it could use hash join again and just do the opposite: with IN when there is processInstanceId in nested select, add it to the result, when there is not, don't add it, with NOT IN when there is processInstanceId in nested select, don't add it to the result, when there is not, add it to the result.

So can you perhaps explain why this happens? To clarify AuditTaskImpl has processInstanceId attribute which is also present in ProcessInstanceLog table, although there is no foreign key relationship between them.

Thanks.

Upvotes: 1

Views: 492

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269813

The semantics of NOT IN require that nothing be returned if any value in the subquery is NULL. Hence, Postgres needs to look at all the values.

I strongly recommend never using NOT IN with a subquery. Always use NOT EXISTS:

DELETE FROM AuditTaskImpl l
    WHERE NOT EXISTS (SELECT 1 
                      FROM ProcessInstanceLog spl
                      WHERE l.processInstanceId = spl.spl.processInstanceId AND
                            spl.status not in (2, 3)
                     );

Upvotes: 4

Related Questions