Fast nearest neighbor matching in using int4range type in PostgreSQL

Question

I have two tables with ranges t1 and t2 and am trying to find the closest match in t2 for every entry in t1.

I am defining distance as follows:

-- distance function
CREATE OR REPLACE FUNCTION distance(
    pos1 int4range,
    pos2 int4range)
    RETURNS integer
    LANGUAGE 'sql'
AS $BODY$SELECT CASE
WHEN pos1 && pos2 THEN 0
WHEN pos1 >> pos2 THEN lower(pos1) - upper(pos2)
WHEN pos1 << pos2 THEN lower(pos2) - upper(pos1)
ELSE NULL
END AS distance$BODY$;

Let's define t1 and t2 as follows:

SELECT setseed(0.20191109);

DROP TABLE IF EXISTS t1;
DROP TABLE IF EXISTS t2;

WITH
t1n AS
(
    SELECT generate_series (1,500) AS id1, floor(random()*1e7)::integer AS n 
)
SELECT id1, int4range(n, n + ceil(random()*10)::integer) AS range1
INTO t1
FROM t1n;

WITH
t2n AS
(
    SELECT generate_series (1,10000) AS id2, floor(random()*1e7)::integer AS n 
)
SELECT id2, int4range(n, n + ceil(random()*100)::integer) AS range2
INTO t2
FROM t2n;

The two straightforward solutions that I can think using a join and a scalar subquery are both very inefficient since they (1) compute all possible distances and aggregate or (2) sort the entirety of t2 for every value in t1:

-- method one
SELECT id1,min(distance(range1,range2)) AS dist
FROM t1
CROSS JOIN t2
GROUP BY 1
ORDER BY 1;

-- method two
SELECT id1,(SELECT distance(t2.range2, t1.range1) FROM t2 ORDER BY distance(t2.range2, t1.range1) LIMIT 1) AS dist
FROM t1
ORDER BY 1;

Here is a rextester link: https://rextester.com/AIYEQ97536

I am wondering if there is a more efficient solution that I am missing?

Update: I've tried using GiST, SP-GiST, B-tree/GiST indexes on the range types but none of these seem to naively support using the <-> distance operator on int4range? At least not in PostgreSQL 10.5? Is there a way to use the distance function I defined at the top of this post with one of these index types?

jjanes · Accepted Answer

but none of these seem to naively support using the <-> distance operator on int4range?

Not only do they not support that operator, that operator doesn't even exist on int4range out of the box. You could create such an operator trivially in SQL:

create operator <-> ( function = distance, leftarg=int4range, rightarg=int4range);

but hooking it up to an index would require quite a lot of C programming.

Without the C programming, you could find the closest thing by taking the union of 3 queries, overlaps, closest left, and closes to the right, and then taking the min of those. The closest one to the right is easy, it must be the first one > the querent, so that should be easily indexable. That doesn't work for the closest to the left, however. You can get around that with a functional index. The combined result is not pretty, but it is generally fast (once the tables have been VACUUM ANALYZEd).

create index on t2 using gist (range2 );
create index on t2 (range2 );
create index on t2 (int4range(-upper(range2),-lower(range2),'[]'));


select id1, distance(range1,range2) AS dist from t1 cross join lateral 
(
    select * from (
       (select * from t2 where t2.range2 && t1.range1 limit 1) union all 
       (select * from t2 where t2.range2 > t1.range1 order by t2.range2 limit 1) union all 
       (select * from t2 where int4range(-upper(range2),-lower(range2),'[]') > int4range(-upper(range1),-lower(range1),'[]') order by int4range(-upper(range2),-lower(range2),'[]') limit 1)
    ) foobar order by distance(range1,range2) limit 1
) foo;

I'm not sure that '[]' is the correct treatment of the endpoints, but this query does give identical results to your method 1.

Fast nearest neighbor matching in using int4range type in PostgreSQL

Answers (1)

Related Questions