Jed Schaaf
Jed Schaaf

Reputation: 1127

SQL Gaps and Islands with multiple overlapping criteria

My (anonymized here) data has several completely separate sidewalks with various lines of chalk in different colors along each sidewalk. I need to find the start and end of each section of drawing with unique combinations of chalk colors. So far, that seems like a fairly straightforward gaps-and-islands problem (excellent Q&A here and explanation here). My issue is that I'd like each "island" to consolidate the longest possible contiguous section of equivalently overlapped values, but instead, it's either giving me every little overlapping piece separately or smudging all the islands for each color together, depending on how I try to partition or sort the row_number()s.

Initial data:

create table #chalk (
    sidewalkId int,
    "start" int,
    "end" int,
    color varchar(50)
);
insert into #chalk values
(1, 0, 5, 'blue'),
(1, 5, 10, 'blue'),
(1, 10, 15, 'blue'),
--(1, 15, 20, null),--nulls may be explicit or implicit
(1, 20, 25, 'blue'),
--(1, 25, 30, null),--nulls may be explicit or implicit
(1, 30, 35, 'blue'),
(1, 35, 40, 'blue'),
(1, 0, 5, 'red'),
(1, 5, 10, 'red'),
(1, 10, 15, 'red'),
(1, 30, 35, 'red')

Desired results:

sidewalkId color start end
1 'blue' 0 15
1 'red' 0 15
1 null 15 20
1 'blue' 20 25
1 null 25 30
1 'blue' 30 35
1 'red' 30 35
1 'blue' 35 40

Bad query 1:

    with cte as (
        select sidewalkId, color
            , row_number() over (partition by sidewalkId order by sidewalkId, "start") rna
            , row_number() over (partition by sidewalkId, color order by sidewalkId, "start") rnc
            , start, "end"
        from #chalk
    )
    select sidewalkId, color
        , min("start") "start", max("end") "end"
        , min(rna) rna_start, max(rna) rna_end
    from cte
    group by sidewalkId, color, rna-rnc
    order by sidewalkId, min(rna)

Bad results 1:

sidewalkId color start end rna_start rna_end
1 blue 0 5 1 1
1 red 0 10 2 3
1 blue 5 15 4 5
1 red 10 15 6 6
1 NULL 15 20 7 7
1 blue 20 25 8 8
1 NULL 25 30 9 9
1 red 25 35 10 11
1 blue 30 40 12 13

Bad query 2:

    with cte as (
        select sidewalkId, color
            , row_number() over (partition by sidewalkId order by sidewalkId, color, "start") rna
            , row_number() over (partition by sidewalkId, color order by sidewalkId, "start") rnc
            , "start", "end"
        from #chalk
    )
    select sidewalkId, color
        , min("start"), max("end")
        , min(rna), max(rna)
    from cte
    group by sidewalkId, color, rna-rnc
    order by sidewalkId, min(rna)

Bad results 2:

sidewalkId color start end rna_start rna_end
1 NULL 15 30 1 2
1 blue 0 40 3 8
1 red 0 35 9 13

I've tried using rank() or dense_rank() instead of row_number() (good comparison of those functions here) to get the desired results, but those don't seem to work for my requirements, either.

Upvotes: 4

Views: 115

Answers (2)

ValNik
ValNik

Reputation: 5916

  1. First, each lines of chalk should be divided into the most minimal parts.
    For example (hand made))
 (1,  0, 10, 'blue'),
 (1,  5, 15, 'red'),
 (1, 15, 20, 'red'),
 (1, 25, 35, 'green'),
 (1, 25, 35, 'yellow')

We take points (0,5,10,15,20,25,35). (subquery u)
And for every part JOIN orginal lines (CTE ranges)

( 0, 5)->(1,0,5,'blue')
( 5,10)->(1,5,10,'blue')
       ->(1,5,10,'red')
(10,15)->(1,10,15,'red')
(15,20)->(1,15,20,'red')
(20,25)->(1,20,25,null)
(25,35)->(1,25,35,'green')
       ->(1,25,35,'yellow')

This is already close to what you want to get, but you need to combine the chains where possible - to get longest lines.

( 0, 5)->(1,0,5,'blue')
( 5,10)->(1,5,10,'blue')
       ->(1,5,10,'red')
(10,20)->(1,10,20,'red')  ( (1,10,15,'red')+(1,15,20,'red') )
(20,25)->(1,20,25,null)
(25,35)->(1,25,35,'green')
       ->(1,25,35,'yellow')

  1. Consolidate back This will be done by 2 steps.
    2.1.Consolidate rows
    2.2.Find gaps
   source                consolidated          isGap   groupN
(1,0,5,'blue')         (1,0,5,'blue')           0        0
(1,5,10,'blue')        (1,5,10,'blue,red')      1        1
(1,5,10,'red')
(1,10,15,'red')        (1,10,15,'red')          1        2
(1,15,20,'red')        (1,15,20,'red')          0        2 
(1,20,25,null)         (1,20,25,null)           1        3
(1,25,35,'green')      (1,25,35,'green,yellow') 1        4
(1,25,35,'yellow')

Aggregating rows, we get

(1,0,5,'blue')                   0
(1,5,10,'blue,red')              1
(1,10,20,'red')                  2
(1,20,25,null)                   3
(1,25,35,'green,yellow')         4

Last step - expand rows, where several colors

(1,0,5,'blue')        0
(1,5,10,'blue')       1
(1,5,10,'red')        1
(1,10,20,'red')       2
(1,20,25,null)        3
(1,25,35,'green')     4
(1,25,35,'yellow')    4

See example
Test data (expanded)

sidewalkId pstart pend color
1 0 5 blue
1 5 10 blue
1 10 15 blue
1 20 25 blue
1 30 35 blue
1 35 40 blue
1 0 5 red
1 5 10 red
1 10 15 red
1 30 35 red
2 0 10 blue
2 10 20 red
2 10 25 green
2 20 25 orange
2 25 35 yellow
with ranges as( -- devide all section's to detailed parts
select p.sidewalkid,p.point,p.next,c.pstart,c.pend,c.color
from(
  select * ,lead(point)over(partition by sidewalkid order by point) next
  from( -- all points in sidewalk's
     select sidewalkid,pstart point from #chalk
     union 
     select sidewalkid,pend point from #chalk
   )u
)p     -- intersection detailed sections and walk parts
left join #chalk c on p.sidewalkid=c.sidewalkid
   and( 
     c.pstart<=p.point and c.pend>p.point
     or 
     c.pstart<p.next and c.pend>=p.next
     )
where next is not null
)
,detailed_path as( -- number of group (island)
select * ,sum(newGr)over(partition by sidewalkid order by point)grN
from(  -- gaps
select * 
  ,case when colors=lag(colors,1,colors)over(partition by sidewalkid order by point) 
     then 0
   else 1 
   end newGr
from( -- consolidate equal parts of all colors
select sidewalkid,point,next
  ,string_agg(color,'-') within group(order by color) colors
from ranges
group by sidewalkid,point,next
)a
)b
)
,consolidated_path as(
select sidewalkid,min(point)pstart,max(next)pend
  ,min(colors)colors,grn
from detailed_path
group by sidewalkid,grn
)
    -- expand back to every color in section
select sidewalkid,nullif(value,'') color, pstart, pend,colors,  grn  
from consolidated_path
cross apply string_split(coalesce(colors,''),'-')cl
order by sidewalkid,grn;
;
sidewalkid color pstart pend colors grn
1 blue 0 15 blue-red 0
1 red 0 15 blue-red 0
1 null 15 20 null 1
1 blue 20 25 blue 2
1 null 25 30 null 3
1 blue 30 35 blue-red 4
1 red 30 35 blue-red 4
1 blue 35 40 blue 5
2 blue 0 10 blue 0
2 green 10 20 green-red 1
2 red 10 20 green-red 1
2 green 20 25 green-orange 2
2 orange 20 25 green-orange 2
2 yellow 25 35 yellow 3

CTE consolidated_path output

sidewalkid pstart pend colors grn
1 0 15 blue-red 0
1 15 20 null 1
1 20 25 blue 2
1 25 30 null 3
1 30 35 blue-red 4
1 35 40 blue 5
2 0 10 blue 0
2 10 20 green-red 1
2 20 25 green-orange 2
2 25 35 yellow 3

Detailed_path

sidewalkid point next colors newGr grN
1 0 5 blue-red 0 0
1 5 10 blue-red 0 0
1 10 15 blue-red 0 0
1 15 20 null 1 1
1 20 25 blue 1 2
1 25 30 null 1 3
1 30 35 blue-red 1 4
1 35 40 blue 1 5
2 0 10 blue 0 0
2 10 20 green-red 1 1
2 20 25 green-orange 1 2
2 25 35 yellow 1 3

fiddle

Upvotes: 2

Guillaume Outters
Guillaume Outters

Reputation: 1607

The islands just have to be computed on the groups of sidewalking colors, not on the isolated colors themselves:
[0;5] is walked through by the blue,red band, then [5;10] too, and so on, [20;25] by the 1-member band blue, etc.

By first determining which color segments walk together,
and computing cte on it instead of #chalk,
you'll be able to take your first query nearly as is (see (1)):

with
  band as
  (
    select
      sidewalkId, start, "end",
      string_agg(color, ',') within group (order by color) name
    from #chalk
    group by sidewalkId, start, "end"
  ),
  cte as (
        select sidewalkId, name
            , row_number() over (partition by sidewalkId order by sidewalkId, start) rna
            , row_number() over (partition by sidewalkId, name order by sidewalkId, start) rnc
            , start, "end"
        from band
    )
  select cte.sidewalkId, color
      , min(cte.start) start, max(cte."end") "end"
      , min(rna) rna_start, max(rna) rna_end, cte.name bandname
  from cte join #chalk o on o.start = cte.start and o."end" = cte."end"
  group by cte.sidewalkId, name, color, rna-rnc
  order by cte.sidewalkId, min(rna)

(the fiddle includes additional columns with the "band names" such as 'blue,red'; another fiddle adds the splitting that @jed-schaaf you already did elsewhere to produce #chalk)

(1) In fact, after fully qualifying fields in your query (startcte.start), the diff with my query is only:

 with
+  band as (…)
   cte as
   (
      […]
-     from #chalk
+     from band
   )
-  select cte.sidewalkId, cte.color
+  select cte.sidewalkId, o.color
   […]
   from cte
+  join #chalk o on o.start = cte.start and o."end" = cte."end"
-  group by cte.sidewalkId, cte.color, rna-rnc
+  group by cte.sidewalkId, o.color, rna-rnc

Upvotes: 2

Related Questions