tws
tws

Reputation: 707

How to reduce index scan time for a postgresql IN query with ordered results?

I built a simple app for reading RSS feeds using rails and postgresql, but am running into performance issues when I try to query my feed_entries table for posts from more than one feed. An example query looks like this, to retrieve the 20 most recent entries for a given collection of feed ids:

SELECT * FROM feed_entries WHERE feed_id IN (19, 21, 383, 1867, 3103) ORDER BY published_at DESC LIMIT 20;

The feed_entries table has about 4 million rows in it, is hosted on Heroku Postgres with the Fugu plan, and it has a few indexes, including:

"index_feed_entries_on_feed_id_and_published_at" btree (feed_id, published_at)
"index_feed_entries_on_published_at" btree (published_at)

Here are the results of the query planner:

EXPLAIN ANALYZE SELECT * FROM feed_entries WHERE feed_id IN (19, 21, 383, 1867, 3103) ORDER BY published_at DESC LIMIT 20;

 Limit  (cost=4353.93..4353.94 rows=20 width=1016) (actual time=12172.275..12172.325 rows=20 loops=1)
   ->  Sort  (cost=4353.93..4355.07 rows=2286 width=1016) (actual time=12172.268..12172.284 rows=20 loops=1)
     Sort Key: published_at
     Sort Method: top-N heapsort  Memory: 52kB
     ->  Index Scan using index_feed_entries_on_feed_id_and_published_at on feed_entries  (cost=0.00..4341.76 rows=2286 width=1016) (actual time=8.612..12169.504 rows=630 loops=1)
           Index Cond: (feed_id = ANY ('{19,21,383,1867,3103}'::integer[]))
Total runtime: 12172.520 ms

The planner looks like it's using the appropriate index, yet scanning the index still takes ~12 seconds, which strikes me as too long for a table that has 4 million rows. If I repeat the query planner exactly as above, then the second time it tells me that the whole thing takes only 2 ms, maybe that's simply because the results of the first query are cached, but it's still confusing to me. I also tried running VACUUM ANALYZE before running the query, but it made little difference. Additionally, if I query the table for a single feed_id, then the query planner uses an Index Scan Backward using index_feed_entries_on_feed_id_and_published_at on feed_entries, and total execution time is much faster, on the order of 20ms.

Are there other strategies I could adopt to optimize the performance of this relatively simple IN query?

Upvotes: 2

Views: 1857

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656706

Another thing to try would be this alternative query form:

SELECT * 
FROM   feed_entries
JOIN  (unnest('{19,21,383,1867,3103}'::int[]) AS feed_id) sub USING (feed_id)
ORDER  BY published_at DESC
LIMIT  20;

Sort order of columns does matter in multi-column indexes, though. Use:

CREATE index_feed_entries_2 ON feed_entries (feed_id, published_at DESC)

If you CLUSTER your table according to this index, this might give you another little boost, but effectiveness deteriorates with a lot of updates. Read the last chapter of this related answer for more info:
Bitmap Heap Scan performance

Of course, all the usual advice on performance optimization applies, too.

Upvotes: 2

bma
bma

Reputation: 9756

Try creating an index with a DESC order. Eg.

create index feed_entries_published_at_desc_idx on feed_entries ( published_at desc ) with (fillfactor=100);

You could try a similar (compound) index as above on (feed_id, published_at desc) to see how that works too.

Upvotes: 1

Related Questions