gbox
gbox

Reputation: 829

Find rows with a column value close to the one of the current row

I have a table in the following format:

id lon lat
1 10.111 20.415
2 10.099 30.132
3 10.110 20.414

I would like to create a new column that returns all IDs such that lon and lat are less then a tolerance value away, i.e. abs(lon_i - lon_j) < tol AND abs(lat_i-lat_j) < tol for all i != j.

My initial thought going about it is to create a temporary duplicate of the table and then join the table giving the names of the columns lon_2, lat_2, and then:

SELECT id 
FROM TABLE_1
WHERE ABS(TABLE_1.lon - TABLE_1.lon_2) < tol 
  AND ABS(TABLE_1.lat - TABLE_1.lat_2) < tol

But there must be a better and more efficient way to approach it.

Upvotes: 2

Views: 688

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656804

all IDs such that lon and lat are less then a tolerance value away

A simple self-join as demonstrated by forpas can solve the task.

However, the simple query has to compute deltas for every combination of two rows before eliminating pairs outside the given tolerance. Basically a Cartesian Product, which scales terribly with O(N²). If your table isn't trivially small, keep reading.

The target is to get O(N) instead. We're going to use some advanced concepts:

  • nearest neighbour search
  • GiST expression index
  • LATERAL join
  • recursive CTE

Nearest neighbour query

If you are unfamiliar, here is an introduction to Nearest-Neighbour Searching with PostGis (which may be in use for your geocodes?).

Either way, we can do the same with the type point in stock Postgres. We need a GiST index, while sticking to your original original table, based on an expression:

CREATE INDEX tbl_point_gist_idx ON tbl USING GiST (point(lon, lat));

Basic query - fast but imperfect

SELECT t.*, t1.id AS near_id, t1.lon AS near_lon, t1.lat AS near_lat
FROM   tbl t
JOIN   LATERAL (
   SELECT *
   FROM   tbl t1
   WHERE  t1.id <> t.id
   ORDER  BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat)
   LIMIT  5  -- arbitrary
   ) t1 ON abs(t.lon - t1.lon) < 0.0011
       AND abs(t.lat - t1.lat) < 0.0011;
id lon lat near_id near_lon near_lat
1 10.111 20.415 3 10.110 20.414
3 10.110 20.414 1 10.111 20.415

db<>fiddle here

Note the LATERAL join: another advanced concept. See:

Using the distance operator <-> to compute geometric distance between points.

A simple correlated subquery wouldn't do. We need to get the nearest neighbours before filtering for distance. This can use the GiST index very efficiently. Else, if a row has no qualifying neighbours, Postgres would have to search the whole table before giving up - which defies the purpose.

Problem: We are forced to use an arbitrary LIMIT, 5 in the example. Additional neighbours within tolerance will be snubbed. We'd need to use the maximum number of possible neighbours, which is yet unknown. With few big outliers it would also be inefficient to overshoot for the rest.

Also, the basic query returns one row per neighbour. We want to aggregate to a single row with all neighbours.

Fast and almost correct

Marry the above with another advanced concept: a recursive CTE:

WITH RECURSIVE d (tol) AS (SELECT float8 '0.001')  -- input tolerance here
, cte AS (
   SELECT t.id, t.lon, t.lat, d.tol
        , 1 AS step, ARRAY [t1.id] AS neighbors
   FROM   tbl t
   CROSS  JOIN d
   JOIN   LATERAL (
      SELECT *
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat)
      LIMIT  1
      ) t1 ON abs(t.lon - t1.lon) < d.tol
          AND abs(t.lat - t1.lat) < d.tol

   UNION ALL
   SELECT t.id, t.lon, t.lat, t.tol
        , t.step + 1, neighbors || t1.id
   FROM   cte t
   JOIN   LATERAL (
      SELECT *
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat)  -- !
      OFFSET t.step                                            -- !
      LIMIT  1                                                 -- !
      ) t1 ON abs(t.lon - t1.lon) < t.tol                      -- !
          AND abs(t.lat - t1.lat) < t.tol                      -- !
   )
SELECT DISTINCT ON (id) id, lon, lat, neighbors
FROM   cte
ORDER  BY id, step DESC;
id lon lat neighbours
1 10.111 20.415 {3}
3 10.110 20.414 {1}

db<>fiddle here

The first (non-recursive) CTE d is just for convenience, to provide the tolerance once. If you wrap the query in a function or work with a prepared statement of concatenate it dynamically, you can drop it and pass the value in multiple spots directly.

The next CTE cte is the rCTE. The non-recursive term only picks the nearest neighbour for each point (LIMIT 1). If it's within tolerance, keep looking, else we're done - eliminating all rows without neighbours immediately.
In the recursive term, increase the OFFSET with every step to inspect the next neighbour, etc.

In the outer SELECT we only take the last step per id with DISTINCT ON. See:

The remaining corner case error is this: you check for abs(lon_i - lon_j) < tol AND abs(lat_i-lat_j) < tol, and that's different from the geometric distance by up to factor sqrt(2). So the query might stop at a neighbour outside tolerance, while the next neighbour further away should still be included. Bounding box vs. bounding circle.

Faster and correct

Check for actual geometric distance instead. Typically makes more sense to begin with:

WITH RECURSIVE d(tol) AS (SELECT float8 '0.001415')  --  input tolerance here
, cte AS (
   SELECT t.id, t.lon, t.lat, d.tol
        , 1 AS step, ARRAY [t1.id] AS neighbors
   FROM   tbl t
   CROSS  JOIN d
   JOIN   LATERAL (
      SELECT *, point(t.lon, t.lat) <-> point(t1.lon, t1.lat) AS dist
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY dist
      LIMIT  1
      ) t1 ON dist < d.tol

   UNION ALL
   SELECT t.id, t.lon, t.lat, t.tol
        , t.step + 1, neighbors || t1.id
   FROM   cte t
   JOIN   LATERAL (
      SELECT *, point(t.lon, t.lat) <-> point(t1.lon, t1.lat) AS dist
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY dist         -- !
      OFFSET t.step          -- !
      LIMIT  1               -- !
      ) t1 ON dist < t.tol   -- !
   )
SELECT DISTINCT ON (id) id, lon, lat, neighbors
FROM   cte
ORDER  BY id, step DESC;

Same result.

db<>fiddle here

Better yet: work with point to begin with

Improved table defintion:

CREATE TABLE tbl (id int PRIMARY KEY, lonlat point);  -- !!
INSERT INTO tbl VALUES
  (1,  '10.111, 20.415')
, (2,  '10.099, 30.132')
, (3,  '10.110, 20.414')
;

A point consists of two float8 quantities and occupies 16 bytes on disk.

Now index the column directly (cheaper):

CREATE INDEX tbl_lonlat_gist_idx ON tbl USING GiST (lonlat);

The query gets simpler and a bit faster, yet:

WITH RECURSIVE d(tol) AS (SELECT float8 '0.001415')  --  input tolerance here
, cte AS (
   SELECT t.id, t.lonlat, d.tol, 1 AS step, ARRAY[t1.id] AS neighbours
   FROM   tbl t
   CROSS  JOIN d
   JOIN   LATERAL (
      SELECT *, t.lonlat <-> t1.lonlat AS dist
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY dist
      LIMIT  1
      ) t1 ON dist < d.tol

   UNION ALL
   SELECT t.id, t.lonlat, t.tol, t.step + 1, neighbours || t1.id
   FROM   cte t
   JOIN   LATERAL (
      SELECT *, t.lonlat <-> t1.lonlat AS dist
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY dist         -- !
      OFFSET t.step          -- !
      LIMIT  1               -- !
      ) t1 ON dist < t.tol   -- !
   )
SELECT DISTINCT ON (id) id, lonlat, neighbours
FROM   cte
ORDER  BY id, step DESC;
id lonlat neighbours
1 (10.111,20.415) {3}
3 (10.11,20.414) {1}

db<>fiddle here

Related:

Upvotes: 2

forpas
forpas

Reputation: 164099

There is no need for a temporary table.

You can do a self join:

SELECT t1.id, t2.id
FROM tablename t1 INNER JOIN tablename t2
ON t2.id <> t1.id AND ABS(t1.lon - t2.lon) < :tol AND ABS(t1.lat - t2.lat) < :tol;

This will return 1 row for each pair of ids that satisfies the conditions.

If you want for each id a comma separated list of all the ids that satisfy the conditions then you can aggregate and use STRING_AGG():

SELECT t1.id, STRING_AGG(t2.id, ';' ORDER BY t2.id)
FROM tablename t1 INNER JOIN tablename t2
ON t2.id <> t1.id AND ABS(t1.lon - t2.lon) < :tol AND ABS(t1.lat - t2.lat) < :tol
GROUP BY t1.id;

Upvotes: 2

Related Questions