Reputation: 829
I have a table in the following format:
id | lon | lat |
---|---|---|
1 | 10.111 | 20.415 |
2 | 10.099 | 30.132 |
3 | 10.110 | 20.414 |
I would like to create a new column that returns all IDs such that lon
and lat
are less then a tolerance value away, i.e. abs(lon_i - lon_j) < tol AND abs(lat_i-lat_j) < tol
for all i != j
.
My initial thought going about it is to create a temporary duplicate of the table and then join the table giving the names of the columns lon_2, lat_2, and then:
SELECT id
FROM TABLE_1
WHERE ABS(TABLE_1.lon - TABLE_1.lon_2) < tol
AND ABS(TABLE_1.lat - TABLE_1.lat_2) < tol
But there must be a better and more efficient way to approach it.
Upvotes: 2
Views: 688
Reputation: 656804
all IDs such that
lon
andlat
are less then a tolerance value away
A simple self-join as demonstrated by forpas can solve the task.
However, the simple query has to compute deltas for every combination of two rows before eliminating pairs outside the given tolerance. Basically a Cartesian Product, which scales terribly with O(N²). If your table isn't trivially small, keep reading.
The target is to get O(N) instead. We're going to use some advanced concepts:
LATERAL
joinIf you are unfamiliar, here is an introduction to Nearest-Neighbour Searching with PostGis (which may be in use for your geocodes?).
Either way, we can do the same with the type point
in stock Postgres. We need a GiST index, while sticking to your original original table, based on an expression:
CREATE INDEX tbl_point_gist_idx ON tbl USING GiST (point(lon, lat));
SELECT t.*, t1.id AS near_id, t1.lon AS near_lon, t1.lat AS near_lat
FROM tbl t
JOIN LATERAL (
SELECT *
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat)
LIMIT 5 -- arbitrary
) t1 ON abs(t.lon - t1.lon) < 0.0011
AND abs(t.lat - t1.lat) < 0.0011;
id | lon | lat | near_id | near_lon | near_lat |
---|---|---|---|---|---|
1 | 10.111 | 20.415 | 3 | 10.110 | 20.414 |
3 | 10.110 | 20.414 | 1 | 10.111 | 20.415 |
db<>fiddle here
Note the LATERAL
join: another advanced concept. See:
Using the distance operator <->
to compute geometric distance between points.
A simple correlated subquery wouldn't do. We need to get the nearest neighbours before filtering for distance. This can use the GiST index very efficiently. Else, if a row has no qualifying neighbours, Postgres would have to search the whole table before giving up - which defies the purpose.
Problem: We are forced to use an arbitrary LIMIT
, 5 in the example. Additional neighbours within tolerance will be snubbed. We'd need to use the maximum number of possible neighbours, which is yet unknown. With few big outliers it would also be inefficient to overshoot for the rest.
Also, the basic query returns one row per neighbour. We want to aggregate to a single row with all neighbours.
Marry the above with another advanced concept: a recursive CTE:
WITH RECURSIVE d (tol) AS (SELECT float8 '0.001') -- input tolerance here
, cte AS (
SELECT t.id, t.lon, t.lat, d.tol
, 1 AS step, ARRAY [t1.id] AS neighbors
FROM tbl t
CROSS JOIN d
JOIN LATERAL (
SELECT *
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat)
LIMIT 1
) t1 ON abs(t.lon - t1.lon) < d.tol
AND abs(t.lat - t1.lat) < d.tol
UNION ALL
SELECT t.id, t.lon, t.lat, t.tol
, t.step + 1, neighbors || t1.id
FROM cte t
JOIN LATERAL (
SELECT *
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat) -- !
OFFSET t.step -- !
LIMIT 1 -- !
) t1 ON abs(t.lon - t1.lon) < t.tol -- !
AND abs(t.lat - t1.lat) < t.tol -- !
)
SELECT DISTINCT ON (id) id, lon, lat, neighbors
FROM cte
ORDER BY id, step DESC;
id | lon | lat | neighbours |
---|---|---|---|
1 | 10.111 | 20.415 | {3} |
3 | 10.110 | 20.414 | {1} |
db<>fiddle here
The first (non-recursive) CTE d
is just for convenience, to provide the tolerance once. If you wrap the query in a function or work with a prepared statement of concatenate it dynamically, you can drop it and pass the value in multiple spots directly.
The next CTE cte
is the rCTE. The non-recursive term only picks the nearest neighbour for each point (LIMIT 1
). If it's within tolerance, keep looking, else we're done - eliminating all rows without neighbours immediately.
In the recursive term, increase the OFFSET
with every step
to inspect the next neighbour, etc.
In the outer SELECT
we only take the last step per id
with DISTINCT ON
. See:
The remaining corner case error is this: you check for abs(lon_i - lon_j) < tol AND abs(lat_i-lat_j) < tol
, and that's different from the geometric distance by up to factor sqrt(2)
. So the query might stop at a neighbour outside tolerance, while the next neighbour further away should still be included. Bounding box vs. bounding circle.
Check for actual geometric distance instead. Typically makes more sense to begin with:
WITH RECURSIVE d(tol) AS (SELECT float8 '0.001415') -- input tolerance here
, cte AS (
SELECT t.id, t.lon, t.lat, d.tol
, 1 AS step, ARRAY [t1.id] AS neighbors
FROM tbl t
CROSS JOIN d
JOIN LATERAL (
SELECT *, point(t.lon, t.lat) <-> point(t1.lon, t1.lat) AS dist
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY dist
LIMIT 1
) t1 ON dist < d.tol
UNION ALL
SELECT t.id, t.lon, t.lat, t.tol
, t.step + 1, neighbors || t1.id
FROM cte t
JOIN LATERAL (
SELECT *, point(t.lon, t.lat) <-> point(t1.lon, t1.lat) AS dist
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY dist -- !
OFFSET t.step -- !
LIMIT 1 -- !
) t1 ON dist < t.tol -- !
)
SELECT DISTINCT ON (id) id, lon, lat, neighbors
FROM cte
ORDER BY id, step DESC;
Same result.
db<>fiddle here
point
to begin withImproved table defintion:
CREATE TABLE tbl (id int PRIMARY KEY, lonlat point); -- !!
INSERT INTO tbl VALUES
(1, '10.111, 20.415')
, (2, '10.099, 30.132')
, (3, '10.110, 20.414')
;
A point
consists of two float8 quantities and occupies 16 bytes on disk.
Now index the column directly (cheaper):
CREATE INDEX tbl_lonlat_gist_idx ON tbl USING GiST (lonlat);
The query gets simpler and a bit faster, yet:
WITH RECURSIVE d(tol) AS (SELECT float8 '0.001415') -- input tolerance here
, cte AS (
SELECT t.id, t.lonlat, d.tol, 1 AS step, ARRAY[t1.id] AS neighbours
FROM tbl t
CROSS JOIN d
JOIN LATERAL (
SELECT *, t.lonlat <-> t1.lonlat AS dist
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY dist
LIMIT 1
) t1 ON dist < d.tol
UNION ALL
SELECT t.id, t.lonlat, t.tol, t.step + 1, neighbours || t1.id
FROM cte t
JOIN LATERAL (
SELECT *, t.lonlat <-> t1.lonlat AS dist
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY dist -- !
OFFSET t.step -- !
LIMIT 1 -- !
) t1 ON dist < t.tol -- !
)
SELECT DISTINCT ON (id) id, lonlat, neighbours
FROM cte
ORDER BY id, step DESC;
id | lonlat | neighbours |
---|---|---|
1 | (10.111,20.415) | {3} |
3 | (10.11,20.414) | {1} |
db<>fiddle here
Related:
Upvotes: 2
Reputation: 164099
There is no need for a temporary table.
You can do a self join:
SELECT t1.id, t2.id
FROM tablename t1 INNER JOIN tablename t2
ON t2.id <> t1.id AND ABS(t1.lon - t2.lon) < :tol AND ABS(t1.lat - t2.lat) < :tol;
This will return 1 row for each pair of id
s that satisfies the conditions.
If you want for each id
a comma separated list of all the id
s that satisfy the conditions then you can aggregate and use STRING_AGG()
:
SELECT t1.id, STRING_AGG(t2.id, ';' ORDER BY t2.id)
FROM tablename t1 INNER JOIN tablename t2
ON t2.id <> t1.id AND ABS(t1.lon - t2.lon) < :tol AND ABS(t1.lat - t2.lat) < :tol
GROUP BY t1.id;
Upvotes: 2