Reputation: 707
I built a simple app for reading RSS feeds using rails and postgresql, but am running into performance issues when I try to query my feed_entries
table for posts from more than one feed. An example query looks like this, to retrieve the 20 most recent entries for a given collection of feed ids:
SELECT * FROM feed_entries WHERE feed_id IN (19, 21, 383, 1867, 3103) ORDER BY published_at DESC LIMIT 20;
The feed_entries
table has about 4 million rows in it, is hosted on Heroku Postgres with the Fugu plan, and it has a few indexes, including:
"index_feed_entries_on_feed_id_and_published_at" btree (feed_id, published_at)
"index_feed_entries_on_published_at" btree (published_at)
Here are the results of the query planner:
EXPLAIN ANALYZE SELECT * FROM feed_entries WHERE feed_id IN (19, 21, 383, 1867, 3103) ORDER BY published_at DESC LIMIT 20;
Limit (cost=4353.93..4353.94 rows=20 width=1016) (actual time=12172.275..12172.325 rows=20 loops=1)
-> Sort (cost=4353.93..4355.07 rows=2286 width=1016) (actual time=12172.268..12172.284 rows=20 loops=1)
Sort Key: published_at
Sort Method: top-N heapsort Memory: 52kB
-> Index Scan using index_feed_entries_on_feed_id_and_published_at on feed_entries (cost=0.00..4341.76 rows=2286 width=1016) (actual time=8.612..12169.504 rows=630 loops=1)
Index Cond: (feed_id = ANY ('{19,21,383,1867,3103}'::integer[]))
Total runtime: 12172.520 ms
The planner looks like it's using the appropriate index, yet scanning the index still takes ~12 seconds, which strikes me as too long for a table that has 4 million rows. If I repeat the query planner exactly as above, then the second time it tells me that the whole thing takes only 2 ms, maybe that's simply because the results of the first query are cached, but it's still confusing to me. I also tried running VACUUM ANALYZE
before running the query, but it made little difference. Additionally, if I query the table for a single feed_id, then the query planner uses an Index Scan Backward using index_feed_entries_on_feed_id_and_published_at on feed_entries
, and total execution time is much faster, on the order of 20ms.
Are there other strategies I could adopt to optimize the performance of this relatively simple IN query?
Upvotes: 2
Views: 1857
Reputation: 656706
Another thing to try would be this alternative query form:
SELECT *
FROM feed_entries
JOIN (unnest('{19,21,383,1867,3103}'::int[]) AS feed_id) sub USING (feed_id)
ORDER BY published_at DESC
LIMIT 20;
Sort order of columns does matter in multi-column indexes, though. Use:
CREATE index_feed_entries_2 ON feed_entries (feed_id, published_at DESC)
If you CLUSTER
your table according to this index, this might give you another little boost, but effectiveness deteriorates with a lot of updates. Read the last chapter of this related answer for more info:
Bitmap Heap Scan performance
Of course, all the usual advice on performance optimization applies, too.
Upvotes: 2
Reputation: 9756
Try creating an index with a DESC order. Eg.
create index feed_entries_published_at_desc_idx on feed_entries ( published_at desc ) with (fillfactor=100);
You could try a similar (compound) index as above on (feed_id, published_at desc) to see how that works too.
Upvotes: 1