Nathan Long
Nathan Long

Reputation: 125942

How can I find close-together points in a PostgreSQL database?

I've got a PostgreSQL table of geographic nodes, each of which has a latitude and longitude. If any two nodes are less than 1 km apart, there's a good chance that one of them was created erroneously.

How can I detect this?

An Extremely Slow Solution

The "brute force" way is basically, "for each node, examine each other node and see how far it is from the first one. If it's less than 1 km, add the pair to a list."

An SQL implementation would be:

SELECT
  n1.id AS n1_id, n1.latitude AS n1_lat, n1.longitude AS n1_long,
  n2.id AS n2_id, n2.latitude AS n2_lat, n2.longitude AS n2_long
FROM
  nodes n1
INNER JOIN nodes n2 ON (
  n1.id != n2.id
  AND earth_distance(
    ll_to_earth(n1.latitude, n1.longitude),
    ll_to_earth(n2.latitude, n2.longitude)
  ) < 1000
)

This is horribly slow; for N nodes, there are roughly N2 pairs to examine. On my database, it's infeasible to do this.

Upvotes: 1

Views: 183

Answers (2)

Bohemian
Bohemian

Reputation: 425043

To find all nodes without a nearby other node, use an outer self-join that filters out matches for near nodes found:

SELECT n1.*
FROM nodes n1
LEFT JOIN nodes n2 ON n1.id != n2.id
    AND earth_distance(
        ll_to_earth(n1.latitude, n1.longitude),
        ll_to_earth(n2.latitude, n2.longitude)
    ) < 1000
WHERE n2.id IS NULL

That should run in an acceptable time. If not, you could optimuze the query considerably by pre-calculating all the ll_to_earth() values once for each node using a constant table expression:

WITH cte AS (
    SELECT *, ll_to_earth(latitude, longitude) earth
    FROM nodes
)
SELECT n1.*
FROM cte n1
LEFT JOIN cte n2 ON n1.id != n2.id
    AND earth_distance(n1.earth, n2.earth) < 1000
WHERE n2.id IS NULL

That should run quite a bit faster.

Upvotes: 1

Christian
Christian

Reputation: 7320

Why don't you restrict your join to only "nearby" nodes?

Create 2 indexes on table nodes:

CREATE INDEX I_NODES_LAT ON NODES (LATITUDE, ID);
CREATE INDEX I_NODES_LON ON NODES (LONGITUDE, ID);
ANALYZE NODES;

Second, try this way:

SELECT
  n1.id AS n1_id, n1.latitude AS n1_lat, n1.longitude AS n1_long,
  n2.id AS n2_id, n2.latitude AS n2_lat, n2.longitude AS n2_long
FROM
  nodes n1
INNER JOIN nodes n2 ON  n1.id != n2.id

-- the trick is to get only nearby nodes. 
-- You can mess with "0.1" to restrict even more... It is just as guess...
AND N2.latitude  between (N1.latitude - 0.1)  and (N1.latitude + 0.1)
AND N2.longitude between (N1.longitude - 0.1) and (N1.longitude + 0.1) 

WHERE
  earth_distance
  (
    ll_to_earth(n1.latitude, n1.longitude),
    ll_to_earth(n2.latitude, n2.longitude)
  ) < 1000

Upvotes: 1

Related Questions