Chron Bag
Chron Bag

Reputation: 587

Simple SQL query that filters by geographic distance is very slow

Here's my query:

SELECT 1 
FROM post po
WHERE ST_DWithin(po.geog, (SELECT geog FROM person WHERE person.person_id = $1), 20000 * 1609.34, false)
ORDER BY post_id DESC
LIMIT 5;

And here's the EXPLAIN ANALYZE:

enter image description here

I have an index on everything, so I'm not sure why this is slow. The first 5 posts when sorting by post_id DESC satisfy the clause, so shouldn't this return instantly?

I notice that if I replace the ST_DWithin call with an ST_Distance call instead, it runs instantly, like so:

SELECT 1 
FROM post po
WHERE ST_Distance(po.geog, (SELECT geog FROM person WHERE person.person_id = $1)) < 20000 * 1609.34
ORDER BY post_id DESC
LIMIT 5;

That one runs in .15 milliseconds. So, simple solution is to just replace the ST_DWithin call with the ST_Distance call, no?

Well, unfortunately not, because it's not always the first 5 rows that match. Sometimes it has to scan deep within the table, so at that point ST_DWithin is better because it can use the geographic index, while ST_Distance cannot.

I think this may be a problem of postgres' query planner messing up? Like, for some reason it thinks it needs to do a scan of the whole table, despite the ORDER BY x LIMIT 5 clause being front and center? Not sure..

Upvotes: 0

Views: 705

Answers (2)

JGH
JGH

Reputation: 17836

The distance you are using is almost the length of the equator, so you can expect (almost) all of your results to satisfy this clause.

As ST_DWithin makes use of a spatial index, the planner (wrongly) thinks it will be faster to use it to first filter out the rows. It then has to order (almost) all rows and at last will keep the first 5 ones.

When using st_distance, no spatial index can be used and the planner will pick a different plan, likely one relying on an index on post_id, which is blazing fast. But when the number of rows to be returned (the limit) increases, a different plan is used and the planner probably believe it would be again faster to compute the distance on all rows.

Upvotes: 3

jjanes
jjanes

Reputation: 44137

The first 5 posts when sorting by post_id DESC satisfy the clause, so shouldn't this return instantly?

This is a fact the system has no way of knowing ahead of time. It can't use unknown facts when planning the query. It thinks it will find only 10 rows. That means it thinks it would have to scan half the index on post_id before accumulating 5 rows (out of 10) which meet the geometry condition.

It actually finds 100,000 rows (an oddly round number). But it doesn't know that until after the fact.

If you were to first to a query on SELECT geog FROM person WHERE person.person_id = $1 and then write the result of that directly into your main query, rather than as a subquery, it might (or might not) do a better job of planning.

Upvotes: 1

Related Questions