Dmitry Bubnenkov
Dmitry Bubnenkov

Reputation: 9879

Should I use single of composite index?

What is the right way to set index for next query?

SELECT t1.purchaseNumber, t1.parsing_status, t1.docPublishDate
  FROM xml_files t1
  LEFT JOIN xml_files t2
    ON t1.purchaseNumber = t2.purchaseNumber
    AND t1.docPublishDate < t2.docPublishDate
  WHERE t1.parsing_status IS NULL 
  AND t2.parsing_status IS NULL 
  AND t2.docPublishDate IS NULL

AND t1.section_name='contracts' AND t1.parsing_status IS NULL AND t1.random IN (1,2,3,4)

Should I create composite index or better to create single index for every table that used in query? Also if I am doing comparison of timestamp docPublishDate how should I create in index? Should I use desc keyword?

purchaseNumber - varchar(50)

parsing_status - varchar(10)

random - integer

section_name - varchar(10)

EXPLAIN (ANALYZE, BUFFERS) query;:

Gather  (cost=1000.86..137158.61 rows=43091 width=35) (actual time=22366.063..72674.678 rows=46518 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=99244069 read=144071
  ->  Nested Loop Anti Join  (cost=0.86..131849.51 rows=17955 width=35) (actual time=22309.989..72440.514 rows=15506 loops=3)
        Buffers: shared hit=99244069 read=144071
        ->  Parallel Index Scan using index_for_xml_files_parsing_status on xml_files t1  (cost=0.43..42606.31 rows=26932 width=35) (actual time=0.086..193.982 rows=40725 loops=3)
              Index Cond: ((parsing_status IS NULL) AND (parsing_status IS NULL))
              Filter: (((section_name)::text = 'contracts'::text) AND (random = ANY ('{1,2,3,4}'::integer[])))
              Rows Removed by Filter: 383974
              Buffers: shared hit=15724 read=42304
        ->  Index Scan using "index_for_xml_files_purchaseNumber" on xml_files t2  (cost=0.43..4.72 rows=3 width=27) (actual time=1.773..1.773 rows=1 loops=122174)
              Index Cond: (("purchaseNumber")::text = (t1."purchaseNumber")::text)
              Filter: (t1."docPublishDate" < "docPublishDate")
              Rows Removed by Filter: 6499
              Buffers: shared hit=99228345 read=101767
Planning Time: 0.396 ms
Execution Time: 72681.868 ms

Data example: How to improve speed of query?

Upvotes: 0

Views: 61

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1271241

You should explain what you want the query to do. I would write the query more clearly as:

SELECT t1.purchaseNumber, t1.parsing_status, t1.docPublishDate
FROM xml_files t1
WHERE t1.section_name = 'contracts' AND
      t1.parsing_status IS NULL AND
      t1.random IN (1, 2, 3, 4) AND
      NOT EXISTS (SELECT 1
                  FROM xml_files t2
                  WHERE t1.purchaseNumber = t2.purchaseNumber AND
                        t1.docPublishDate < t2.docPublishDate
                 );

For this query, I would suggest the the following indexes:

create index idx_xml_files_3 on xml_files(section_name, random)
    where parsing_status is null;

create index idx_xml_files_2 on xml_files(purchaseNumber, docPublishDate);

There is probably an even better way to write the query, using window functions for instance. However, it is not clear what your data looks like nor what the query is intended to do.

Upvotes: 1

Laurenz Albe
Laurenz Albe

Reputation: 248305

The index scan on the inner side of the nested loop join is inefficient: on average, 6499 of the 6500 rows found are discarded.

Create a better index:

CREATE INDEX ON xml_files ("purchaseNumber", "docPublishDate");

Upvotes: 0

Related Questions