gene b.
gene b.

Reputation: 11994

Slow Postgres Query with a SELECT DISTINCT and a Join

We have a Postgres DB which was migrated from Oracle. There is one query which is slow in Posgres (8-11 sec) but was fast in Oracle (1-2 sec). It involves a SELECT DISTINCT, which several other threads have flagged as a problem in Postgres: 1, 2. Is there something simple we can do, like add an index or something, without a major rewrite?

The query joins the table NED_PERSON_T with itself to get all rows whose UNIQUEIDENTIFIER is the parent value specified in the child's NIHSERVAO. It is a SELECT DISTINCT as below.

select
    distinct nedperson0_.ID as ID1_21_,
    nedperson0_.BUILDINGNAME as BUILDING2_21_,
    nedperson0_.CREATED_DATE as CREATED_4_21_,
    nedperson0_.CURRENT_FLAG as CURRENT_5_21_,
    /* ... + 30 other columns ... */
from
    NED_PERSON_T nedperson0_ 
inner join
    NED_PERSON_T nedperson1_ 
        on (
            nedperson1_.NIHSERVAO=nedperson0_.UNIQUEIDENTIFIER 
            and nedperson0_.CURRENT_FLAG='Y' 
            and nedperson1_.CURRENT_FLAG='Y'
        ) 
order by
    nedperson0_.NIHMIXCASESN,
    nedperson0_.MIXCASE_GIVENNAME

The EXPLAIN ANALYZE gives this report:

Unique  (cost=362155.58..390755.66 rows=119167 width=1197) (actual time=8722.383..11202.495 rows=838 loops=1)
        Sort Key: nedperson0_.nihmixcasesn, nedperson0_.mixcase_givenname, nedperson0_.id, nedperson0_.buildingname, nedperson0_.created_date, nedperson0_.description, ... ... )
        Sort Method: external merge  Disk: 78016kB
        ->  Gather  (cost=103675.72..227466.24 rows=119167 width=1197) (actual time=2485.391..3100.424 rows=97678 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Parallel Hash Join  (cost=102675.72..214549.54 rows=49653 width=1197) (actual time=2477.318..2880.555 rows=32559 loops=3)
                    Hash Cond: (nedperson1_.nihservao = nedperson0_.uniqueidentifier)
                    ->  Parallel Index Scan using ned_person_t_current_flag_idx on ned_person_t nedperson1_  (cost=0.43..93430.46 rows=58867 width=11) (actual time=7.536..815.779 rows=46800 loops=3)
                          Index Cond: (current_flag = 'Y'::bpchar)
                    ->  Parallel Hash  (cost=93430.46..93430.46 rows=58867 width=1154) (actual time=1620.284..1620.285 rows=46800 loops=3)
                          Buckets: 4096  Batches: 64  Memory Usage: 1792kB
                          ->  Parallel Index Scan using ned_person_t_current_flag_idx on ned_person_t nedperson0_  (cost=0.43..93430.46 rows=58867 width=1154) (actual time=4.602..1226.402 rows=46800 loops=3)
                                Index Cond: (current_flag = 'Y'::bpchar)
Planning Time: 3.986 ms
Execution Time: 11218.403 ms

Indexes :

CREATE INDEX ned_person_t_ao_test1 ON ned.ned_person_t USING btree (nihmixcasesn, mixcase_givenname, uniqueidentifier)
CREATE INDEX ned_person_t_current_flag_idx ON ned.ned_person_t USING btree (current_flag)
CREATE INDEX ned_person_t_id_idx ON ned.ned_person_t USING btree (id)
CREATE INDEX ned_person_t_mdslink_idx ON ned.ned_person_t USING btree (nihmdslinktoadnih)
CREATE INDEX ned_person_t_nedid_idx ON ned.ned_person_t USING btree (uniqueidentifier)
CREATE INDEX ned_person_t_orgacronym_idx ON ned.ned_person_t USING btree (nihorgacronym)
CREATE INDEX ned_person_t_orgstat_idx ON ned.ned_person_t USING btree (organizationalstat)
CREATE UNIQUE INDEX ned_person_t_pk ON ned.ned_person_t USING btree (id)
CREATE INDEX ned_person_t_sac_idx ON ned.ned_person_t USING btree (nihsac)
CREATE INDEX ned_person_t_temp_idx ON ned.ned_person_t USING btree (nihservao, organizationalstat, nihsac)
CREATE INDEX ned_person_t_temp_idx2 ON ned.ned_person_t USING btree (lower(nihmdslinktoadnih), current_flag, nihservao, organizationalstat, nihsac)

STATISTICS ON NED_PERSON_T :

select count(*) from ned_person_t

Total Count = 1243733

select count(*) from ned_person_t where current_flag = 'Y';

with CURRENT_FLAG = Y: 142540

select count(*) from ned_person_t where nihservao is not null;

with non-NULL NIHSERVAO: 644852

select count(distinct nihservao) from ned_person_t;

Distinct NIHSERVAO: 928

Upvotes: 1

Views: 1053

Answers (3)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 657002

Your EXISTS query looks good.

Only 142540 of 1243733 rows have CURRENT_FLAG='Y'.
And only 644852 have nihservao IS NOT NULL. (Combination with the 1st line matters.)
No cases of uniqueidentifier IS NULL.

If you query on that condition regularly, a partial index should be very efficient. I suggest these two:

CREATE INDEX ON ned.ned_person_t (nihservao)
WHERE current_flag = 'Y' AND nihservao IS NOT NULL;

Since you only need to return uniqueidentifier and fullname (like you later commented) - and if fullname isn't too big - a covering index can help some more. (Requires the table to be vacuumed enough.)

CREATE INDEX ON ned.ned_person_t (uniqueidentifier) INCLUDE (fullname)
WHERE current_flag = 'Y';

Example:

While being at it check whether all your many indexes are actually (still) used. This one is completely redundant, remove it:

CREATE INDEX ned_person_t_id_idx ON ned.ned_person_t USING btree (id)

id is already covered by the unique index ned_person_t_pk (which could really be a PRIMARY KEY instead of just a unique index).

The final sort operation is of lesser importance as only 844 rows remain.

Upvotes: 1

gene b.
gene b.

Reputation: 11994

As more information, per the advice above, I (1) rewrote the query with EXISTS, (2) increases MEM_SIZE to 100 MB. But this still takes 7 sec. Below is the EXPLAIN. If I re-run this query sequentially afterwards it takes much less time due to caching, but the first run takes about 7 sec., as you see below.

UPDATE: The Exists re-write did cut the time by half, to about 4 sec. This run wasn't really representative, we tried a few more (after disconnecting/reconnecting) and the EXISTS was better on average by about half.

QUERY:

select
    nedperson0_.ID as ID1_21_,
    nedperson0_.BUILDINGNAME as BUILDING2_21_,
    nedperson0_.CREATED_DATE as CREATED_4_21_,
    nedperson0_.CURRENT_FLAG as CURRENT_5_21_,
    /* ... + 30 other columns ... */
from
    NED_PERSON_T nedperson0_ 
    where
        exists (select uniqueidentifier from ned_person_t nedperson1_ 
                where nihservao = nedperson0_.uniqueidentifier and nedperson1_.current_flag = 'Y'
                )
        and 
        nedperson0_.current_flag = 'Y'
    order by
        nedperson0_.NIHMIXCASESN,
        nedperson0_.MIXCASE_GIVENNAME

EXPLAIN ANALYZE:

Sort  (cost=139290.39..139293.75 rows=1344 width=1195) (actual time=7627.351..7627.441 rows=844 loops=1)
  Sort Method: quicksort  Memory: 882kB
  ->  Nested Loop  (cost=97523.64..139220.55 rows=1344 width=1195) (actual time=5112.442..7621.895 rows=844 loops=1)
        ->  HashAggregate  (cost=97523.21..97531.04 rows=783 width=11) (actual time=5108.419..5109.041 rows=847 loops=1)
              Group Key: nedperson1_.nihservao
              ->  Index Scan using ned_person_t_current_flag_idx on ned_person_t nedperson1_  (cost=0.43..97172.64 rows=140228 width=11) (actual time=2.376..4999.905 rows=142540 loops=1)
                    Index Cond: (current_flag = 'Y'::bpchar)
        ->  Index Scan using ned_person_t_nedid_idx on ned_person_t nedperson0_  (cost=0.43..53.21 rows=2 width=1152) (actual time=2.964..2.964 rows=1 loops=847)
              Index Cond: (uniqueidentifier = nedperson1_.nihservao)
              Filter: (current_flag = 'Y'::bpchar)
              Rows Removed by Filter: 23
Planning Time: 10.259 ms
Execution Time: 7627.670 ms

Upvotes: 0

jjanes
jjanes

Reputation: 44227

What you need to make your EXISTS query run fast is an index on (current_flag, nihservao) (or maybe partial index on nihservao filtered on current_flag) so that it can use an index-only scan to get the nihservao values without needing to jump to random parts of the table for each row. It would require keeping the table well-vacuumed for this to be effective.

If many of your other queries also focus only on current data, then it might make sense to partition your data by current_flag, to keep the relevant data more densely packed. That is more drastic than just creating an index, though.

Upvotes: 1

Related Questions