Reputation: 304
I have found a surprising (at least to me) thing with IN and NOT IN. When I try to explain first query on PostgreSQL database:
EXPLAIN DELETE
FROM AuditTaskImpl l
WHERE l.processInstanceId in (select spl.processInstanceId
FROM ProcessInstanceLog spl
WHERE spl.status not in ( 2, 3))
it tells me this:
Delete on audittaskimpl l (cost=2794.48..6373.52 rows=50859 width=12)
-> Hash Semi Join (cost=2794.48..6373.52 rows=50859 width=12)
Hash Cond: (l.processinstanceid = spl.processinstanceid)
-> Seq Scan on audittaskimpl l (cost=0.00..2005.59 rows=50859 width=14)
-> Hash (cost=1909.24..1909.24 rows=50899 width=14)
-> Seq Scan on processinstancelog spl (cost=0.00..1909.24 rows=50899 width=14)
Filter: (status <> ALL ('{2,3}'::integer[]))
However, when I change in for not in, which is just a negation:
EXPLAIN DELETE
FROM AuditTaskImpl l
WHERE l.processInstanceId NOT in (select spl.processInstanceId
FROM ProcessInstanceLog spl
WHERE spl.status not in ( 2, 3))
it tells me this:
Delete on audittaskimpl l (cost=0.00..63321079.15 rows=25430 width=6)
-> Seq Scan on audittaskimpl l (cost=0.00..63321079.15 rows=25430 width=6)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..2362.73 rows=50899 width=8)
-> Seq Scan on processinstancelog spl (cost=0.00..1909.24 rows=50899 width=8)
Filter: (status <> ALL ('{2,3}'::integer[]))
As you can see, with IN it uses hash join, which is of course much quicker, but with NOT IN it uses just plain sequential scan row by row. But since NOT IN is just a negation it could use hash join again and just do the opposite: with IN when there is processInstanceId in nested select, add it to the result, when there is not, don't add it, with NOT IN when there is processInstanceId in nested select, don't add it to the result, when there is not, add it to the result.
So can you perhaps explain why this happens? To clarify AuditTaskImpl has processInstanceId attribute which is also present in ProcessInstanceLog table, although there is no foreign key relationship between them.
Thanks.
Upvotes: 1
Views: 492
Reputation: 1269813
The semantics of NOT IN
require that nothing be returned if any value in the subquery is NULL
. Hence, Postgres needs to look at all the values.
I strongly recommend never using NOT IN
with a subquery. Always use NOT EXISTS
:
DELETE FROM AuditTaskImpl l
WHERE NOT EXISTS (SELECT 1
FROM ProcessInstanceLog spl
WHERE l.processInstanceId = spl.spl.processInstanceId AND
spl.status not in (2, 3)
);
Upvotes: 4