partitioning intervals by distance

Question

I've got a table in a Redshift database that contains intervals with lower bounds l and upper bounds u, like so:

| interval_id | l  | u  |
| ----------- | -- | -- |
| 1           | 1  | 10 |
| 2           | 2  | 5  |
| 3           | 5  | 15 |
| 4           | 26 | 30 |
| 5           | 28 | 35 |
| 6           | 30 | 31 |
| 7           | 44 | 45 |
| 8           | 56 | 58 |

I would like to partition these intervals such that the sets are at least 10 apart. The desired result for the example above would be

| interval_id | l  | u  | group |
| ----------- | -- | -- | ----- |
| 1           | 1  | 10 | A     |
| 2           | 2  | 5  | A     |
| 3           | 5  | 15 | A     |
| 4           | 26 | 30 | B     |
| 5           | 28 | 35 | B     |
| 6           | 30 | 31 | B     |
| 7           | 44 | 45 | B     |
| 8           | 56 | 58 | C     |

where e.g. the first three rows would belong to group a because these intervals either overlap or outright contain each other or are less than 10 apart and row 4 belongs to group b because its lower bound is 11 away from the upper bound of the union (well, technically the convex hull) of intervals 1-3.

Gordon Linoff · Accepted Answer

Hmmm. I think you can take the max of the upper bound for all preceding rows to see where a break occurs. Then, the rest is just a cumulative sum:

select u.*,
       sum(case when l < prev_max_u + 10 then 0 else 1 end) over (order by l) as groupid
from (select t.*,
             max(u) over (order by l rows between unbounded preceding and 1 preceding)  as prev_max_u
      from t
     ) u;

I would normally approach this using a correlated subquery, but I think this formulation will work. I might not work if you have two rows that are exact duplicates.

partitioning intervals by distance

Answers (1)

Related Questions