Any ideas why is this join query performing table scan and if there's a way I can eliminate it?

Question

On the following two table structure:

`posts` table:

   Column   |            Type             |                      Modifiers                      
------------+-----------------------------+-----------------------------------------------------
 id         | integer                     | not null default nextval('posts_id_seq'::regclass)
 body       | character varying(1000)     | 
 person_id  | integer                     | 
 created_at | timestamp without time zone | 
 updated_at | timestamp without time zone | 
 slug       | character varying(255)      |

Indexes:

    "posts_pkey" PRIMARY KEY, btree (id)
    "index_on_slug" UNIQUE, btree (slug)
    "index_on_body" btree (body)
    "index_on_created_at" btree (created_at)
    "index_on_updated_at" btree (updated_at)

`posts_topics` table:

  Column  |  Type   |                         Modifiers                          
----------+---------+------------------------------------------------------------
 id       | integer | not null default nextval('posts_topics_id_seq'::regclass)
 quote_id | integer | 
 topic_id | integer |

Indexes:

    "posts_topics_pkey" PRIMARY KEY, btree (id)
    "index_posts_topics_on_topic_id_and_quote_id" UNIQUE, btree (topic_id, quote_id)
    "index_posts_topics_on_quote_id" btree (quote_id)
    "index_posts_topics_on_topic_id" btree (topic_id)

The following query:

SELECT "posts".* FROM "posts" 
INNER JOIN "posts_topics" ON "posts"."id" = "posts_topics"."quote_id" 
WHERE "posts_topics"."topic_id" = 297 
ORDER BY "posts"."updated_at" ASC 
LIMIT 10 OFFSET 16340;

Leads to the following query plan:

 Limit  (cost=65299.69..65299.72 rows=10 width=219) (actual time=768.913..768.914 rows=10 loops=1)
   ->  Sort  (cost=65258.84..65301.95 rows=17243 width=219) (actual time=762.651..768.167 rows=16350 loops=1)
         Sort Key: posts.updated_at
         Sort Method: external merge  Disk: 4664kB
         ->  Hash Join  (cost=30177.21..62214.98 rows=17243 width=219) (actual time=290.098..738.999 rows=17589 loops=1)
               Hash Cond: (posts_topics.quote_id = posts.id)
               ->  Bitmap Heap Scan on posts_topics  (cost=326.06..21967.63 rows=17243 width=4) (actual time=4.343..22.194 rows=17589 loops=1)
                     Recheck Cond: (topic_id = 297)
                     ->  Bitmap Index Scan on index_posts_topics_on_topic_id  (cost=0.00..321.75 rows=17243 width=0) (actual time=2.400..2.400 rows=17589 loops=1)
                           Index Cond: (topic_id = 297)
               ->  Hash  (cost=15750.51..15750.51 rows=329651 width=219) (actual time=280.392..280.392 rows=329651 loops=1)
                     Buckets: 1024  Batches: 128  Memory Usage: 679kB
                     ->  Seq Scan on posts  (cost=0.00..15750.51 rows=329651 width=219) (actual time=0.003..95.573 rows=329651 loops=1)

View this plan on http://explain.depesz.com.

As it can be seen the Hash Join leads to a Seq Scan on posts that has 329651 rows and overall the Hash Join part of query takes about 738ms.

As both of post_topcics.id and posts.id are indexed, I don't understand why a Seq Scan is being performed on posts. Any ideas why?

Also is there a way I can eliminate it?

Update 1

As suggested by @IgorRomanchenko and @a_horse_with_no_name I increased the work_mem to 128MB, and it improved the query run time from the original ~780ms down to ~260ms.

Craig Ringer · Accepted Answer

Stats

Your statistics estimates look bang on, so it doesn't look like bad table stats or bad statistics estimation. See your plan on explain.depesz.com - note that there aren't big mismatches between actual and estimated rows.

Cost parameters

With questions like this, it often turns out that Pg thinks your random I/O is more expensive than it really is, so it's preferring a scan and hash join to an index scan.

Try significantly lowering random_page_cost. Also make sure effective_cache_size reflects your system accurately.

See:

Version

Are you on Pg 9.2 or newer? If not, upgrade. There are plenty of performance improvements. For just one example, index-only scans could be quite useful for a query like this.

Range queries should avoid `OFFSET` where possible

This is exceedingly bad for performance:

LIMIT 10 OFFSET 16340;

Huge limit/offset queries can easily perform quite miserably. PostgreSQL must generate, and discard, all the prior 16340 rows.

If at all possible re-phrase the query to use a filter range. In this case you're doing ORDER BY "posts"."updated_at" ASC, so you should be able to write something like:

WHERE "posts_topics"."topic_id" = 297 
  AND posts.updated_at > ?

and pass the greatest updated_at from the last set of rows you retrieved as the input.

Force multiple passes

If you're on 9.2 with index-only scans, it's possible that you may be able to benefit from a two-pass approach: Determine which post IDs are of interest, then do another scan to get the rest of the post info for those IDs. That way you aren't generating and throwing away the other ten thousand.

SELECT p2.*
FROM (
  SELECT posts.id
  FROM posts
  WHERE EXISTS (
    SELECT 1
    FROM posts_topics pt
    WHERE pt.quote_id = posts.id 
    AND pt.topic_id = 297
  )
  ORDER BY "posts"."updated_at" ASC 
  LIMIT 10 
  OFFSET 16340
) wanted_posts
INNER JOIN posts p2 ON wanted_posts.id = p2.id;

This almost certainly won't do any good unless you're able to use an index-only scan on post_topics and/or posts. Otherwise you're paying most of the cost of the heap lookups for each discarded row anyway.

I'm a bit dubious about whether that'll be possible because of the need to join, then sort and discard. A composite index on posts_topics(topic_id, quote_id) and on posts(id, updated_at) (or opposite ordering) might prove useful, but I'm not sure and don't have the time to delve in with test cases or poking at the guts of the query planner code at present.

Any ideas why is this join query performing table scan and if there's a way I can eliminate it?

`posts` table:

`posts_topics` table:

Update 1

Answers (2)

Stats

Cost parameters

Version

Range queries should avoid `OFFSET` where possible

Force multiple passes

Related Questions

Any ideas why is this join query performing table scan and if there&#39;s a way I can eliminate it?

posts table:

posts_topics table:

Update 1

Answers (2)

Stats

Cost parameters

Version

Range queries should avoid OFFSET where possible

Force multiple passes

Related Questions

Any ideas why is this join query performing table scan and if there's a way I can eliminate it?

`posts` table:

`posts_topics` table:

Range queries should avoid `OFFSET` where possible