Reputation: 125942
I've got a PostgreSQL table of geographic nodes
, each of which has a latitude
and longitude
. If any two nodes are less than 1 km apart, there's a good chance that one of them was created erroneously.
How can I detect this?
The "brute force" way is basically, "for each node, examine each other node and see how far it is from the first one. If it's less than 1 km, add the pair to a list."
An SQL implementation would be:
SELECT
n1.id AS n1_id, n1.latitude AS n1_lat, n1.longitude AS n1_long,
n2.id AS n2_id, n2.latitude AS n2_lat, n2.longitude AS n2_long
FROM
nodes n1
INNER JOIN nodes n2 ON (
n1.id != n2.id
AND earth_distance(
ll_to_earth(n1.latitude, n1.longitude),
ll_to_earth(n2.latitude, n2.longitude)
) < 1000
)
This is horribly slow; for N nodes, there are roughly N2 pairs to examine. On my database, it's infeasible to do this.
Upvotes: 1
Views: 183
Reputation: 425043
To find all nodes without a nearby other node, use an outer self-join that filters out matches for near nodes found:
SELECT n1.*
FROM nodes n1
LEFT JOIN nodes n2 ON n1.id != n2.id
AND earth_distance(
ll_to_earth(n1.latitude, n1.longitude),
ll_to_earth(n2.latitude, n2.longitude)
) < 1000
WHERE n2.id IS NULL
That should run in an acceptable time. If not, you could optimuze the query considerably by pre-calculating all the ll_to_earth()
values once for each node using a constant table expression:
WITH cte AS (
SELECT *, ll_to_earth(latitude, longitude) earth
FROM nodes
)
SELECT n1.*
FROM cte n1
LEFT JOIN cte n2 ON n1.id != n2.id
AND earth_distance(n1.earth, n2.earth) < 1000
WHERE n2.id IS NULL
That should run quite a bit faster.
Upvotes: 1
Reputation: 7320
Why don't you restrict your join to only "nearby" nodes?
Create 2 indexes on table nodes
:
CREATE INDEX I_NODES_LAT ON NODES (LATITUDE, ID);
CREATE INDEX I_NODES_LON ON NODES (LONGITUDE, ID);
ANALYZE NODES;
Second, try this way:
SELECT
n1.id AS n1_id, n1.latitude AS n1_lat, n1.longitude AS n1_long,
n2.id AS n2_id, n2.latitude AS n2_lat, n2.longitude AS n2_long
FROM
nodes n1
INNER JOIN nodes n2 ON n1.id != n2.id
-- the trick is to get only nearby nodes.
-- You can mess with "0.1" to restrict even more... It is just as guess...
AND N2.latitude between (N1.latitude - 0.1) and (N1.latitude + 0.1)
AND N2.longitude between (N1.longitude - 0.1) and (N1.longitude + 0.1)
WHERE
earth_distance
(
ll_to_earth(n1.latitude, n1.longitude),
ll_to_earth(n2.latitude, n2.longitude)
) < 1000
Upvotes: 1