How to optimize a PostrgreSQL query and avoid using "NOT IN"?

Question

A newbie question for PostgreSQL. I have the following query in which I am trying to return a union of all records that fit the criteria of:

all dx marked chronic prior to tencounter,
all recurrent dx (i.e., a dx record that is marked treated more recent then any previous resolution, and
any dx record marked treated that has never been resolved

Is there a better way of doing this (perhaps using the WITH clause of PostgreSQL)? I have read not to us "NOT IN" in PostgreSQL, so how better could this be done? How do you "optimize" this thing?

   CREATE OR REPLACE FUNCTION f_getactivedx(groupid character varying, tencounter timestamp without time zone)
   RETURNS SETOF view_dx AS
$BODY$

    select max(dx.recid) as recid, dx.cicd9, dx.cdesc, max( dx.tposted) as tposted, 
    bool_or(dx.resolved) as resolved, bool_or(dx.treated) as treated, bool_or(dx.chronic),
    dx.groupid 
        from dx 
    where dx.chronic = true 
    and dx.groupid = $1 
    and date_trunc('day',dx.tposted) <= date_trunc('day',$2)
    group by dx.cicd9, dx.cdesc, dx.groupid

    union


    select max(dx.recid) as recid, dx.cicd9, dx.cdesc, max( dx.tposted) as tposted, 
    bool_and(dx.resolved), bool_and(dx.treated), bool_and(dx.chronic), dx.groupid 
    from dx 
            join    (select cdesc, max(tposted) as tposted from dx  
                       where groupid =$1  and resolved = true and 
                       date_trunc('day',tposted) <=  date_trunc('day', $2)
                       group by cdesc) j 
              on (dx.cdesc = j.cdesc and dx.tposted > j.tposted) 
              where groupid = $1 and treated = true
              and date_trunc('day',dx.tposted)  <= date_trunc('day', $2)
              group by dx.cicd9, dx.cdesc, dx.groupid 

    union 

    select max(dx.recid) as recid, dx.cicd9, dx.cdesc, max( dx.tposted), 
       bool_and(dx.resolved), bool_and(dx.treated), bool_and(dx.chronic), dx.groupid 
              from dx 
              where dx.cdesc NOT IN
                (select cdesc from dx  
                       where groupid =$1  and resolved = true and 
                       date_trunc('day',tposted) <=  date_trunc('day', $2)
                       group by cdesc) 
             and groupid =$1 and treated = true and
              date_trunc('day',tposted)   <= date_trunc('day', $2)        
              group by dx.cicd9, dx.cdesc, dx.groupid

    order by tposted desc, treated desc, resolved desc, cdesc asc

Craig Ringer · Accepted Answer

NOT IN can be OK, you just have to think carefully about NULL, and an anti-join is often a better choice.

For any query:

SELECT ...
FROM t
WHERE col NOT IN (SELECT col2 FROM t2 WHERE col2 IS NOT NULL AND ...predicate...)

you can equivalently write:

SELECT ...
FROM t LEFT OUTER JOIN t2 ON (t.col = t2.col2 AND ...predicate...)
WHERE t2.col2 IS NULL;

which is called a "left anti-join".

PostgreSQL may well produce the same query plan for both.

Unless you know that your NOT IN use is causing performance issues for some reason, after proper checking with explain analyze, I strongly suggest just leaving it alone.

How to optimize a PostrgreSQL query and avoid using "NOT IN"?

Answers (2)

Related Questions

How to optimize a PostrgreSQL query and avoid using &quot;NOT IN&quot;?

Answers (2)

Related Questions

How to optimize a PostrgreSQL query and avoid using "NOT IN"?