Robs
Robs

Reputation: 153

postgres two column sort low performance

I've got a query that performs multiple joins. I try to get only those positions of each keyword that are latest in results.

Here is the query:

SELECT DISTINCT ON (p.keyword_id)
  a.id        AS account_id,
  w.parent_id AS parent_id,
  w.name      AS name,
  p.position  AS position
FROM websites w
  JOIN accounts a ON w.account_id = a.id
  JOIN keywords k ON k.website_id = w.parent_id
  JOIN positions p ON p.website_id = w.parent_id
WHERE a.amount > 0 AND w.parent_id NOTNULL AND (round((a.amount / a.payment_renewal_period), 2) BETWEEN 1 AND 19)
ORDER BY p.keyword_id, p.created_at DESC;

Plan with costs for that query is as follows:

   Unique  (cost=73673.65..76630.38 rows=264 width=40) (actual time=30777.117..49143.023 rows=259 loops=1)
       ->  Sort  (cost=73673.65..75152.02 rows=591347 width=40) (actual time=30777.116..47352.373 rows=10891486 loops=1)
             Sort Key: p.keyword_id, p.created_at DESC
             Sort Method: external merge  Disk: 512672kB
             ->  Merge Join  (cost=219.59..812.26 rows=591347 width=40) (actual time=3.487..3827.028 rows=10891486 loops=1)
                   Merge Cond: (w.parent_id = k.website_id)
                   ->  Nested Loop  (cost=128.46..597.73 rows=1268 width=44) (actual time=3.378..108.915 rows=61582 loops=1)
                         ->  Nested Loop  (cost=2.28..39.86 rows=1 width=28) (actual time=0.026..0.216 rows=7 loops=1)
                               ->  Index Scan using index_websites_on_parent_id on websites w  (cost=0.14..15.08 rows=4 width=28) (actual time=0.004..0.023 rows=7 loops=1)
                                     Index Cond: (parent_id IS NOT NULL)
                               ->  Bitmap Heap Scan on accounts a  (cost=2.15..6.18 rows=1 width=4) (actual time=0.019..0.020 rows=1 loops=7)
                                     Recheck Cond: (id = w.account_id)
                                     Filter: ((amount > '0'::numeric) AND (round((amount / (payment_renewal_period)::numeric), 2) >= '1'::numeric) AND (round((amount / (payment_renewal_period)::numeric), 2) <= '19'::numeric))
                                     Heap Blocks: exact=7
                                     ->  Bitmap Index Scan on accounts_pkey  (cost=0.00..2.15 rows=1 width=0) (actual time=0.006..0.006 rows=1 loops=7)
                                           Index Cond: (id = w.account_id)
                         ->  Bitmap Heap Scan on positions p  (cost=126.18..511.57 rows=4631 width=16) (actual time=0.994..8.226 rows=8797 loops=7)
                               Recheck Cond: (website_id = w.parent_id)
                               Heap Blocks: exact=1004
                               ->  Bitmap Index Scan on index_positions_on_5_columns  (cost=0.00..125.02 rows=4631 width=0) (actual time=0.965..0.965 rows=8797 loops=7)
                                     Index Cond: (website_id = w.parent_id)
                   ->  Sort  (cost=18.26..18.92 rows=264 width=4) (actual time=0.106..1013.966 rows=10891487 loops=1)
                         Sort Key: k.website_id
                         Sort Method: quicksort  Memory: 37kB
                         ->  Seq Scan on keywords k  (cost=0.00..7.64 rows=264 width=4) (actual time=0.005..0.039 rows=263 loops=1)
     Planning time: 1.081 ms
     Execution time: 49184.222 ms

The thing is when I run query with w.id instead of w.parent_id in join positions part total cost decreases to

Unique  (cost=3621.07..3804.99 rows=264 width=40) (actual time=128.430..139.550 rows=259 loops=1)
   ->  Sort  (cost=3621.07..3713.03 rows=36784 width=40) (actual time=128.429..135.444 rows=40385 loops=1)
         Sort Key: p.keyword_id, p.created_at DESC
         Sort Method: external sort  Disk: 2000kB
         ->  Merge Join  (cost=128.73..831.59 rows=36784 width=40) (actual time=25.521..63.299 rows=40385 loops=1)
               Merge Cond: (k.website_id = w.id)
               ->  Index Only Scan using index_keywords_on_website_id_deleted_at on keywords k  (cost=0.27..24.23 rows=264 width=4) (actual time=0.137..0.274 rows=263 loops=1)
                     Heap Fetches: 156
               ->  Materialize  (cost=128.46..606.85 rows=1268 width=44) (actual time=3.772..49.587 rows=72242 loops=1)
                     ->  Nested Loop  (cost=128.46..603.68 rows=1268 width=44) (actual time=3.769..30.530 rows=61582 loops=1)
                           ->  Nested Loop  (cost=2.28..45.80 rows=1 width=32) (actual time=0.047..0.204 rows=7 loops=1)
                                 ->  Index Scan using websites_pkey on websites w  (cost=0.14..21.03 rows=4 width=32) (actual time=0.007..0.026 rows=7 loops=1)
                                       Filter: (parent_id IS NOT NULL)
                                       Rows Removed by Filter: 4
                                 ->  Bitmap Heap Scan on accounts a  (cost=2.15..6.18 rows=1 width=4) (actual time=0.018..0.019 rows=1 loops=7)
                                       Recheck Cond: (id = w.account_id)
                                       Filter: ((amount > '0'::numeric) AND (round((amount / (payment_renewal_period)::numeric), 2) >= '1'::numeric) AND (round((amount / (payment_renewal_period)::numeric), 2) <= '19'::numeric))
                                       Heap Blocks: exact=7
                                       ->  Bitmap Index Scan on accounts_pkey  (cost=0.00..2.15 rows=1 width=0) (actual time=0.004..0.004 rows=1 loops=7)
                                             Index Cond: (id = w.account_id)
                           ->  Bitmap Heap Scan on positions p  (cost=126.18..511.57 rows=4631 width=16) (actual time=0.930..2.341 rows=8797 loops=7)
                                 Recheck Cond: (website_id = w.parent_id)
                                 Heap Blocks: exact=1004
                                 ->  Bitmap Index Scan on index_positions_on_5_columns  (cost=0.00..125.02 rows=4631 width=0) (actual time=0.906..0.906 rows=8797 loops=7)
                                       Index Cond: (website_id = w.parent_id)
 Planning time: 1.124 ms
 Execution time: 157.167 ms

Indexes on websites

Indexes:
    "websites_pkey" PRIMARY KEY, btree (id)
    "index_websites_on_account_id" btree (account_id)
    "index_websites_on_deleted_at" btree (deleted_at)
    "index_websites_on_domain_id" btree (domain_id)
    "index_websites_on_parent_id" btree (parent_id)
    "index_websites_on_processed_at" btree (processed_at)

Indexes on positions

Indexes:
    "positions_pkey" PRIMARY KEY, btree (id)
    "index_positions_on_5_columns" UNIQUE, btree (website_id, keyword_id, created_at, engine_id, region_id)
    "overlap_index" btree (keyword_id, created_at)

Upvotes: 0

Views: 295

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246848

The second EXPLAIN output shows more than 200 times fewer rows, so it is hardly surprising that sorting is much faster.

You will notice that the sort spills to disk in both cases (Sort Method: external merge Disk: ...kB). If you can keep the sort in memory by raising work_mem, it will be much faster.
But the first sort is so large that you won't be able to fit it in memory.

Ideas to speed up the query:

  • An index on (keyword_id, created_at)for positions. Not sure if that helps though.

  • Do the filtering first, like this:

    SELECT
       a.id        AS account_id,
       w.parent_id AS parent_id,
       w.name      AS name,
       p.position  AS position
    FROM (SELECT DISTINCT ON (keyword_id)
             positions,
             website_id,
             keyword_id,
             created_at
          FROM positions
          ORDER BY keyword_id, created_at DESC) p
       JOIN ...
    WHERE ...
    ORDER BY p.keyword_id, p.created_at DESC;
    

Remark: The DISTINCT ON is somewhat strange, since you do not ORDER BY the values of the SELECT list, so the result values are not well defined.

Upvotes: 1

Related Questions