TheFooProgrammer
TheFooProgrammer

Reputation: 2517

Any ideas why is this join query performing table scan and if there's a way I can eliminate it?

On the following two table structure:

posts table:

   Column   |            Type             |                      Modifiers                      
------------+-----------------------------+-----------------------------------------------------
 id         | integer                     | not null default nextval('posts_id_seq'::regclass)
 body       | character varying(1000)     | 
 person_id  | integer                     | 
 created_at | timestamp without time zone | 
 updated_at | timestamp without time zone | 
 slug       | character varying(255)      | 

Indexes:

    "posts_pkey" PRIMARY KEY, btree (id)
    "index_on_slug" UNIQUE, btree (slug)
    "index_on_body" btree (body)
    "index_on_created_at" btree (created_at)
    "index_on_updated_at" btree (updated_at)

posts_topics table:

  Column  |  Type   |                         Modifiers                          
----------+---------+------------------------------------------------------------
 id       | integer | not null default nextval('posts_topics_id_seq'::regclass)
 quote_id | integer | 
 topic_id | integer | 

Indexes:

    "posts_topics_pkey" PRIMARY KEY, btree (id)
    "index_posts_topics_on_topic_id_and_quote_id" UNIQUE, btree (topic_id, quote_id)
    "index_posts_topics_on_quote_id" btree (quote_id)
    "index_posts_topics_on_topic_id" btree (topic_id)

The following query:

SELECT "posts".* FROM "posts" 
INNER JOIN "posts_topics" ON "posts"."id" = "posts_topics"."quote_id" 
WHERE "posts_topics"."topic_id" = 297 
ORDER BY "posts"."updated_at" ASC 
LIMIT 10 OFFSET 16340;

Leads to the following query plan:

 Limit  (cost=65299.69..65299.72 rows=10 width=219) (actual time=768.913..768.914 rows=10 loops=1)
   ->  Sort  (cost=65258.84..65301.95 rows=17243 width=219) (actual time=762.651..768.167 rows=16350 loops=1)
         Sort Key: posts.updated_at
         Sort Method: external merge  Disk: 4664kB
         ->  Hash Join  (cost=30177.21..62214.98 rows=17243 width=219) (actual time=290.098..738.999 rows=17589 loops=1)
               Hash Cond: (posts_topics.quote_id = posts.id)
               ->  Bitmap Heap Scan on posts_topics  (cost=326.06..21967.63 rows=17243 width=4) (actual time=4.343..22.194 rows=17589 loops=1)
                     Recheck Cond: (topic_id = 297)
                     ->  Bitmap Index Scan on index_posts_topics_on_topic_id  (cost=0.00..321.75 rows=17243 width=0) (actual time=2.400..2.400 rows=17589 loops=1)
                           Index Cond: (topic_id = 297)
               ->  Hash  (cost=15750.51..15750.51 rows=329651 width=219) (actual time=280.392..280.392 rows=329651 loops=1)
                     Buckets: 1024  Batches: 128  Memory Usage: 679kB
                     ->  Seq Scan on posts  (cost=0.00..15750.51 rows=329651 width=219) (actual time=0.003..95.573 rows=329651 loops=1)

View this plan on http://explain.depesz.com.

As it can be seen the Hash Join leads to a Seq Scan on posts that has 329651 rows and overall the Hash Join part of query takes about 738ms.

As both of post_topcics.id and posts.id are indexed, I don't understand why a Seq Scan is being performed on posts. Any ideas why?

Also is there a way I can eliminate it?

Update 1

As suggested by @IgorRomanchenko and @a_horse_with_no_name I increased the work_mem to 128MB, and it improved the query run time from the original ~780ms down to ~260ms.

Upvotes: 1

Views: 1540

Answers (2)

Craig Ringer
Craig Ringer

Reputation: 325051

Stats

Your statistics estimates look bang on, so it doesn't look like bad table stats or bad statistics estimation. See your plan on explain.depesz.com - note that there aren't big mismatches between actual and estimated rows.

Cost parameters

With questions like this, it often turns out that Pg thinks your random I/O is more expensive than it really is, so it's preferring a scan and hash join to an index scan.

Try significantly lowering random_page_cost. Also make sure effective_cache_size reflects your system accurately.

See:

Version

Are you on Pg 9.2 or newer? If not, upgrade. There are plenty of performance improvements. For just one example, index-only scans could be quite useful for a query like this.

Range queries should avoid OFFSET where possible

This is exceedingly bad for performance:

LIMIT 10 OFFSET 16340;

Huge limit/offset queries can easily perform quite miserably. PostgreSQL must generate, and discard, all the prior 16340 rows.

If at all possible re-phrase the query to use a filter range. In this case you're doing ORDER BY "posts"."updated_at" ASC, so you should be able to write something like:

WHERE "posts_topics"."topic_id" = 297 
  AND posts.updated_at > ?

and pass the greatest updated_at from the last set of rows you retrieved as the input.

Force multiple passes

If you're on 9.2 with index-only scans, it's possible that you may be able to benefit from a two-pass approach: Determine which post IDs are of interest, then do another scan to get the rest of the post info for those IDs. That way you aren't generating and throwing away the other ten thousand.

SELECT p2.*
FROM (
  SELECT posts.id
  FROM posts
  WHERE EXISTS (
    SELECT 1
    FROM posts_topics pt
    WHERE pt.quote_id = posts.id 
    AND pt.topic_id = 297
  )
  ORDER BY "posts"."updated_at" ASC 
  LIMIT 10 
  OFFSET 16340
) wanted_posts
INNER JOIN posts p2 ON wanted_posts.id = p2.id;

This almost certainly won't do any good unless you're able to use an index-only scan on post_topics and/or posts. Otherwise you're paying most of the cost of the heap lookups for each discarded row anyway.

I'm a bit dubious about whether that'll be possible because of the need to join, then sort and discard. A composite index on posts_topics(topic_id, quote_id) and on posts(id, updated_at) (or opposite ordering) might prove useful, but I'm not sure and don't have the time to delve in with test cases or poking at the guts of the query planner code at present.

Upvotes: 2

Denis de Bernardy
Denis de Bernardy

Reputation: 78561

As I read your plan and the stats, it is the best plan you can possibly get...

The issue with your query is the huge offset, which basically amounts to fetching 10 of the last rows yielded by the query without the limit/offset. What more, Postgres anticipates precisely this with its row estimates.

Think of such a query as fetching the top 16350 rows of a set, and eliminating them all except the last 10. This is way beyond "use an index" territory in your case since it basically amounts to fetching the entire set (and a significant part of the table, to boot): if you do an index scan for that with your cardinality, you'll basically do random accesses to the disk across the plan (and de facto, the entire table), and that'll be even slower; better grab the whole thing sequentially, sort in memory, and begone with it.

You might achieve an index-only scan in recent Postgres versions (provided all columns involved are in indexes), but it wouldn't be significantly faster: you'd still end up reading and hash joining the entire index anyway...

Edit: as point out by @a_horse_with_no_name, that being said, there is some writing to disk involved that you could avoid by increasng the work mem. Increasing it might make the hash joining and sorting in your plan materially faster.

Upvotes: 2

Related Questions