How do I make postgres avoid doing a double sequential scan for this seek pagination query?

Question

The Schema

I have a bunch of posts stored in a table (feed_items)
I have a table containing which user id liked/disliked which feed_item_id (feed_item_likes_dislikes)
I have another table containing which user id loved/angered which feed_item_id (feed_item_love_anger)
I have a fourth table containing which feed_item_id has what tags where tags is an ARRAY of varchar (feed_item_tags)
The total number of likes/dislikes per post is stored in a materialized view (feed_item_likes_dislikes_aggregate)
The total number of love/anger is stored in another materialized view (feed_item_love_anger_agregate)
Likes dislikes love and anger are stored separately because a post can be liked/disliked and loved/angered at the same time (business requirements unfortunately)
I have 2 columns called title_vector and summary_vector of type TSVECTOR in feed_items which is helpful to find posts by search keyword (full text search in postgres)

The Problem

I want to find all the posts in DESCENDING order of their pubdate and feed_item_id
Some posts are published at the same time and I want to paginate using (pubdate, feed_item_id) < (value1, value2) seek pagination method described HERE

My Page 1 Query

Find posts with likes > 0 having the word scam in their title or summary tagged for

SELECT
  fi.feed_item_id,
  pubdate,
  link,
  title,
  summary,
  author,
  feed_id,
  likes,
  dislikes,
  love,
  anger,
  tags 
FROM
  feed_items fi 
  LEFT JOIN
    feed_item_tags t 
    ON fi.feed_item_id = t.feed_item_id 
  LEFT JOIN
    feed_item_love_anger_aggregate bba 
    ON fi.feed_item_id = bba.feed_item_id 
  LEFT JOIN
    feed_item_likes_dislikes_aggregate lda 
    ON fi.feed_item_id = lda.feed_item_id 
WHERE
  (
    title_vector @@ to_tsquery('scam') 
    OR summary_vector @@ to_tsquery('scam')
  )
  AND 'for' = ANY(tags) 
  AND likes > 0 
ORDER BY
  pubdate DESC,
  feed_item_id DESC LIMIT 3;

EXPLAIN ANALYZE Page 1

 Limit  (cost=2.83..16.88 rows=3 width=233) (actual time=0.075..0.158 rows=3 loops=1)
   ->  Nested Loop Left Join  (cost=2.83..124.53 rows=26 width=233) (actual time=0.074..0.157 rows=3 loops=1)
         ->  Nested Loop  (cost=2.69..116.00 rows=26 width=217) (actual time=0.067..0.146 rows=3 loops=1)
               Join Filter: (t.feed_item_id = fi.feed_item_id)
               Rows Removed by Join Filter: 73
               ->  Index Scan using idx_feed_items_pubdate_feed_item_id_desc on feed_items fi  (cost=0.14..68.77 rows=76 width=62) (actual time=0.016..0.023 rows=3 loops=1)
                     Filter: ((title_vector @@ to_tsquery('scam'::text)) OR (summary_vector @@ to_tsquery('scam'::text)))
                     Rows Removed by Filter: 1
               ->  Materialize  (cost=2.55..8.56 rows=34 width=187) (actual time=0.016..0.037 rows=25 loops=3)
                     ->  Hash Join  (cost=2.55..8.39 rows=34 width=187) (actual time=0.044..0.091 rows=36 loops=1)
                           Hash Cond: (t.feed_item_id = lda.feed_item_id)
                           ->  Seq Scan on feed_item_tags t  (cost=0.00..5.25 rows=67 width=155) (actual time=0.009..0.043 rows=67 loops=1)
                                 Filter: ('for'::text = ANY ((tags)::text[]))
                                 Rows Removed by Filter: 33
                           ->  Hash  (cost=1.93..1.93 rows=50 width=32) (actual time=0.029..0.029 rows=50 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 12kB
                                 ->  Seq Scan on feed_item_likes_dislikes_aggregate lda  (cost=0.00..1.93 rows=50 width=32) (actual time=0.004..0.013 rows=50 loops=1)
                                       Filter: (likes > 0)
                                       Rows Removed by Filter: 24
         ->  Index Scan using idx_feed_item_love_anger_aggregate on feed_item_love_anger_aggregate bba  (cost=0.14..0.32 rows=1 width=32) (actual time=0.002..0.003 rows=0 loops=3)
               Index Cond: (feed_item_id = fi.feed_item_id)
 Planning Time: 0.601 ms
 Execution Time: 0.195 ms
(23 rows)

It is doing a sequential scan 2 times despite having the appropriate indices on all tables

My Page N Query

Take the published date and feed_item_id of the 3rd result from the above query and load the next 3 results

SELECT
  fi.feed_item_id,
  pubdate,
  link,
  title,
  summary,
  author,
  feed_id,
  likes,
  dislikes,
  love,
  anger,
  tags 
FROM
  feed_items fi 
  LEFT JOIN
    feed_item_tags t 
    ON fi.feed_item_id = t.feed_item_id 
  LEFT JOIN
    feed_item_love_anger_aggregate bba 
    ON fi.feed_item_id = bba.feed_item_id 
  LEFT JOIN
    feed_item_likes_dislikes_aggregate lda 
    ON fi.feed_item_id = lda.feed_item_id 
WHERE
  (
    pubdate,
    fi.feed_item_id
  )
  < ('2020-06-19 19:50:00+05:30', 'bc5c8dfe-13a9-d97a-a328-0e5b8990c500') 
  AND 
  (
    title_vector @@ to_tsquery('scam') 
    OR summary_vector @@ to_tsquery('scam')
  )
  AND 'for' = ANY(tags) 
  AND likes > 0 
ORDER BY
  pubdate DESC,
  feed_item_id DESC LIMIT 3;

Explain Page N Query Despite filtering it is doing 2 sequential scans

 Limit  (cost=2.83..17.13 rows=3 width=233) (actual time=0.082..0.199 rows=3 loops=1)
   ->  Nested Loop Left Join  (cost=2.83..121.97 rows=25 width=233) (actual time=0.081..0.198 rows=3 loops=1)
         ->  Nested Loop  (cost=2.69..113.67 rows=25 width=217) (actual time=0.073..0.185 rows=3 loops=1)
               Join Filter: (t.feed_item_id = fi.feed_item_id)
               Rows Removed by Join Filter: 183
               ->  Index Scan using idx_feed_items_pubdate_feed_item_id_desc on feed_items fi  (cost=0.14..67.45 rows=74 width=62) (actual time=0.014..0.034 rows=6 loops=1)
                     Index Cond: (ROW(pubdate, feed_item_id) < ROW('2020-06-19 19:50:00+05:30'::timestamp with time zone, 'bc5c8dfe-13a9-d97a-a328-0e5b8990c500'::uuid))
                     Filter: ((title_vector @@ to_tsquery('scam'::text)) OR (summary_vector @@ to_tsquery('scam'::text)))
                     Rows Removed by Filter: 2
               ->  Materialize  (cost=2.55..8.56 rows=34 width=187) (actual time=0.009..0.022 rows=31 loops=6)
                     ->  Hash Join  (cost=2.55..8.39 rows=34 width=187) (actual time=0.050..0.098 rows=36 loops=1)
                           Hash Cond: (t.feed_item_id = lda.feed_item_id)
                           ->  Seq Scan on feed_item_tags t  (cost=0.00..5.25 rows=67 width=155) (actual time=0.009..0.044 rows=67 loops=1)
                                 Filter: ('for'::text = ANY ((tags)::text[]))
                                 Rows Removed by Filter: 33
                           ->  Hash  (cost=1.93..1.93 rows=50 width=32) (actual time=0.028..0.029 rows=50 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 12kB
                                 ->  Seq Scan on feed_item_likes_dislikes_aggregate lda  (cost=0.00..1.93 rows=50 width=32) (actual time=0.005..0.014 rows=50 loops=1)
                                       Filter: (likes > 0)
                                       Rows Removed by Filter: 24
         ->  Index Scan using idx_feed_item_love_anger_aggregate on feed_item_love_anger_aggregate bba  (cost=0.14..0.32 rows=1 width=32) (actual time=0.003..0.003 rows=1 loops=3)
               Index Cond: (feed_item_id = fi.feed_item_id)
 Planning Time: 0.596 ms
 Execution Time: 0.236 ms
(24 rows)

LINK TO THE FIDDLE

I have setup a fiddle with the required tables and indexes, could someone tell me how to fix the query to use an Index Scan at best or cut down the number of Sequential Scans to 1?

jjanes · Accepted Answer

You currently have no other index on the tags table than the GIN index. In your fiddle, if I create index on feed_item_tags (feed_item_id) and do an ANALYZE, then both seq scans go away. It is probably better to do it this way then it is to reformulate so that it can use the GIN index, like my other answer, because this way makes more efficient use of the prospect of stopping early with the LIMIT.

But really, what is the point of "feed_item_tags" table? If you are going to have a child table to list tags, you would usually have one tag/parent_id combination per row. If you want an array of tags instead of a column of them, why not just stick the array directly into the parent table? There are sometimes reasons to have tables with 1:1 relationship between two tables, but not very often.

How do I make postgres avoid doing a double sequential scan for this seek pagination query?

Answers (2)

Related Questions