Reputation: 11994
We have a Postgres DB which was migrated from Oracle. There is one query which is slow in Posgres (8-11 sec) but was fast in Oracle (1-2 sec). It involves a SELECT DISTINCT, which several other threads have flagged as a problem in Postgres: 1, 2. Is there something simple we can do, like add an index or something, without a major rewrite?
The query joins the table NED_PERSON_T
with itself to get all rows whose UNIQUEIDENTIFIER
is the parent value specified in the child's NIHSERVAO
. It is a SELECT DISTINCT
as below.
select
distinct nedperson0_.ID as ID1_21_,
nedperson0_.BUILDINGNAME as BUILDING2_21_,
nedperson0_.CREATED_DATE as CREATED_4_21_,
nedperson0_.CURRENT_FLAG as CURRENT_5_21_,
/* ... + 30 other columns ... */
from
NED_PERSON_T nedperson0_
inner join
NED_PERSON_T nedperson1_
on (
nedperson1_.NIHSERVAO=nedperson0_.UNIQUEIDENTIFIER
and nedperson0_.CURRENT_FLAG='Y'
and nedperson1_.CURRENT_FLAG='Y'
)
order by
nedperson0_.NIHMIXCASESN,
nedperson0_.MIXCASE_GIVENNAME
The EXPLAIN ANALYZE gives this report:
Unique (cost=362155.58..390755.66 rows=119167 width=1197) (actual time=8722.383..11202.495 rows=838 loops=1)
Sort Key: nedperson0_.nihmixcasesn, nedperson0_.mixcase_givenname, nedperson0_.id, nedperson0_.buildingname, nedperson0_.created_date, nedperson0_.description, ... ... )
Sort Method: external merge Disk: 78016kB
-> Gather (cost=103675.72..227466.24 rows=119167 width=1197) (actual time=2485.391..3100.424 rows=97678 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Join (cost=102675.72..214549.54 rows=49653 width=1197) (actual time=2477.318..2880.555 rows=32559 loops=3)
Hash Cond: (nedperson1_.nihservao = nedperson0_.uniqueidentifier)
-> Parallel Index Scan using ned_person_t_current_flag_idx on ned_person_t nedperson1_ (cost=0.43..93430.46 rows=58867 width=11) (actual time=7.536..815.779 rows=46800 loops=3)
Index Cond: (current_flag = 'Y'::bpchar)
-> Parallel Hash (cost=93430.46..93430.46 rows=58867 width=1154) (actual time=1620.284..1620.285 rows=46800 loops=3)
Buckets: 4096 Batches: 64 Memory Usage: 1792kB
-> Parallel Index Scan using ned_person_t_current_flag_idx on ned_person_t nedperson0_ (cost=0.43..93430.46 rows=58867 width=1154) (actual time=4.602..1226.402 rows=46800 loops=3)
Index Cond: (current_flag = 'Y'::bpchar)
Planning Time: 3.986 ms
Execution Time: 11218.403 ms
Indexes :
CREATE INDEX ned_person_t_ao_test1 ON ned.ned_person_t USING btree (nihmixcasesn, mixcase_givenname, uniqueidentifier)
CREATE INDEX ned_person_t_current_flag_idx ON ned.ned_person_t USING btree (current_flag)
CREATE INDEX ned_person_t_id_idx ON ned.ned_person_t USING btree (id)
CREATE INDEX ned_person_t_mdslink_idx ON ned.ned_person_t USING btree (nihmdslinktoadnih)
CREATE INDEX ned_person_t_nedid_idx ON ned.ned_person_t USING btree (uniqueidentifier)
CREATE INDEX ned_person_t_orgacronym_idx ON ned.ned_person_t USING btree (nihorgacronym)
CREATE INDEX ned_person_t_orgstat_idx ON ned.ned_person_t USING btree (organizationalstat)
CREATE UNIQUE INDEX ned_person_t_pk ON ned.ned_person_t USING btree (id)
CREATE INDEX ned_person_t_sac_idx ON ned.ned_person_t USING btree (nihsac)
CREATE INDEX ned_person_t_temp_idx ON ned.ned_person_t USING btree (nihservao, organizationalstat, nihsac)
CREATE INDEX ned_person_t_temp_idx2 ON ned.ned_person_t USING btree (lower(nihmdslinktoadnih), current_flag, nihservao, organizationalstat, nihsac)
STATISTICS ON NED_PERSON_T :
select count(*) from ned_person_t
Total Count = 1243733
select count(*) from ned_person_t where current_flag = 'Y';
with CURRENT_FLAG = Y: 142540
select count(*) from ned_person_t where nihservao is not null;
with non-NULL NIHSERVAO: 644852
select count(distinct nihservao) from ned_person_t;
Distinct NIHSERVAO: 928
Upvotes: 1
Views: 1053
Reputation: 657002
Your EXISTS
query looks good.
Only 142540 of 1243733 rows have CURRENT_FLAG='Y'
.
And only 644852 have nihservao IS NOT NULL
. (Combination with the 1st line matters.)
No cases of uniqueidentifier IS NULL
.
If you query on that condition regularly, a partial index should be very efficient. I suggest these two:
CREATE INDEX ON ned.ned_person_t (nihservao)
WHERE current_flag = 'Y' AND nihservao IS NOT NULL;
Since you only need to return uniqueidentifier
and fullname
(like you later commented) - and if fullname
isn't too big - a covering index can help some more. (Requires the table to be vacuumed enough.)
CREATE INDEX ON ned.ned_person_t (uniqueidentifier) INCLUDE (fullname)
WHERE current_flag = 'Y';
Example:
While being at it check whether all your many indexes are actually (still) used. This one is completely redundant, remove it:
CREATE INDEX ned_person_t_id_idx ON ned.ned_person_t USING btree (id)
id
is already covered by the unique index ned_person_t_pk
(which could really be a PRIMARY KEY
instead of just a unique index).
The final sort operation is of lesser importance as only 844
rows remain.
Upvotes: 1
Reputation: 11994
As more information, per the advice above, I (1) rewrote the query with EXISTS, (2) increases MEM_SIZE to 100 MB. But this still takes 7 sec. Below is the EXPLAIN. If I re-run this query sequentially afterwards it takes much less time due to caching, but the first run takes about 7 sec., as you see below.
UPDATE: The Exists re-write did cut the time by half, to about 4 sec. This run wasn't really representative, we tried a few more (after disconnecting/reconnecting) and the EXISTS was better on average by about half.
QUERY:
select
nedperson0_.ID as ID1_21_,
nedperson0_.BUILDINGNAME as BUILDING2_21_,
nedperson0_.CREATED_DATE as CREATED_4_21_,
nedperson0_.CURRENT_FLAG as CURRENT_5_21_,
/* ... + 30 other columns ... */
from
NED_PERSON_T nedperson0_
where
exists (select uniqueidentifier from ned_person_t nedperson1_
where nihservao = nedperson0_.uniqueidentifier and nedperson1_.current_flag = 'Y'
)
and
nedperson0_.current_flag = 'Y'
order by
nedperson0_.NIHMIXCASESN,
nedperson0_.MIXCASE_GIVENNAME
EXPLAIN ANALYZE:
Sort (cost=139290.39..139293.75 rows=1344 width=1195) (actual time=7627.351..7627.441 rows=844 loops=1)
Sort Method: quicksort Memory: 882kB
-> Nested Loop (cost=97523.64..139220.55 rows=1344 width=1195) (actual time=5112.442..7621.895 rows=844 loops=1)
-> HashAggregate (cost=97523.21..97531.04 rows=783 width=11) (actual time=5108.419..5109.041 rows=847 loops=1)
Group Key: nedperson1_.nihservao
-> Index Scan using ned_person_t_current_flag_idx on ned_person_t nedperson1_ (cost=0.43..97172.64 rows=140228 width=11) (actual time=2.376..4999.905 rows=142540 loops=1)
Index Cond: (current_flag = 'Y'::bpchar)
-> Index Scan using ned_person_t_nedid_idx on ned_person_t nedperson0_ (cost=0.43..53.21 rows=2 width=1152) (actual time=2.964..2.964 rows=1 loops=847)
Index Cond: (uniqueidentifier = nedperson1_.nihservao)
Filter: (current_flag = 'Y'::bpchar)
Rows Removed by Filter: 23
Planning Time: 10.259 ms
Execution Time: 7627.670 ms
Upvotes: 0
Reputation: 44227
What you need to make your EXISTS query run fast is an index on (current_flag, nihservao)
(or maybe partial index on nihservao filtered on current_flag) so that it can use an index-only scan to get the nihservao values without needing to jump to random parts of the table for each row. It would require keeping the table well-vacuumed for this to be effective.
If many of your other queries also focus only on current data, then it might make sense to partition your data by current_flag, to keep the relevant data more densely packed. That is more drastic than just creating an index, though.
Upvotes: 1