Reputation: 9879
What is the right way to set index for next query?
SELECT t1.purchaseNumber, t1.parsing_status, t1.docPublishDate
FROM xml_files t1
LEFT JOIN xml_files t2
ON t1.purchaseNumber = t2.purchaseNumber
AND t1.docPublishDate < t2.docPublishDate
WHERE t1.parsing_status IS NULL
AND t2.parsing_status IS NULL
AND t2.docPublishDate IS NULL
AND t1.section_name='contracts' AND t1.parsing_status IS NULL AND t1.random IN (1,2,3,4)
Should I create composite index or better to create single index for every table that used in query?
Also if I am doing comparison of timestamp docPublishDate
how should I create in index? Should I use desc
keyword?
purchaseNumber
- varchar(50)
parsing_status
- varchar(10)
random
- integer
section_name
- varchar(10)
EXPLAIN (ANALYZE, BUFFERS) query;
:
Gather (cost=1000.86..137158.61 rows=43091 width=35) (actual time=22366.063..72674.678 rows=46518 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=99244069 read=144071
-> Nested Loop Anti Join (cost=0.86..131849.51 rows=17955 width=35) (actual time=22309.989..72440.514 rows=15506 loops=3)
Buffers: shared hit=99244069 read=144071
-> Parallel Index Scan using index_for_xml_files_parsing_status on xml_files t1 (cost=0.43..42606.31 rows=26932 width=35) (actual time=0.086..193.982 rows=40725 loops=3)
Index Cond: ((parsing_status IS NULL) AND (parsing_status IS NULL))
Filter: (((section_name)::text = 'contracts'::text) AND (random = ANY ('{1,2,3,4}'::integer[])))
Rows Removed by Filter: 383974
Buffers: shared hit=15724 read=42304
-> Index Scan using "index_for_xml_files_purchaseNumber" on xml_files t2 (cost=0.43..4.72 rows=3 width=27) (actual time=1.773..1.773 rows=1 loops=122174)
Index Cond: (("purchaseNumber")::text = (t1."purchaseNumber")::text)
Filter: (t1."docPublishDate" < "docPublishDate")
Rows Removed by Filter: 6499
Buffers: shared hit=99228345 read=101767
Planning Time: 0.396 ms
Execution Time: 72681.868 ms
Data example: How to improve speed of query?
Upvotes: 0
Views: 61
Reputation: 1271241
You should explain what you want the query to do. I would write the query more clearly as:
SELECT t1.purchaseNumber, t1.parsing_status, t1.docPublishDate
FROM xml_files t1
WHERE t1.section_name = 'contracts' AND
t1.parsing_status IS NULL AND
t1.random IN (1, 2, 3, 4) AND
NOT EXISTS (SELECT 1
FROM xml_files t2
WHERE t1.purchaseNumber = t2.purchaseNumber AND
t1.docPublishDate < t2.docPublishDate
);
For this query, I would suggest the the following indexes:
create index idx_xml_files_3 on xml_files(section_name, random)
where parsing_status is null;
create index idx_xml_files_2 on xml_files(purchaseNumber, docPublishDate);
There is probably an even better way to write the query, using window functions for instance. However, it is not clear what your data looks like nor what the query is intended to do.
Upvotes: 1
Reputation: 248305
The index scan on the inner side of the nested loop join is inefficient: on average, 6499 of the 6500 rows found are discarded.
Create a better index:
CREATE INDEX ON xml_files ("purchaseNumber", "docPublishDate");
Upvotes: 0