SQL Gaps and Islands with multiple overlapping criteria

Question

My (anonymized here) data has several completely separate sidewalks with various lines of chalk in different colors along each sidewalk. I need to find the start and end of each section of drawing with unique combinations of chalk colors. So far, that seems like a fairly straightforward gaps-and-islands problem (excellent Q&A here and explanation here). My issue is that I'd like each "island" to consolidate the longest possible contiguous section of equivalently overlapped values, but instead, it's either giving me every little overlapping piece separately or smudging all the islands for each color together, depending on how I try to partition or sort the row_number()s.

Initial data:

create table #chalk (
    sidewalkId int,
    "start" int,
    "end" int,
    color varchar(50)
);
insert into #chalk values
(1, 0, 5, 'blue'),
(1, 5, 10, 'blue'),
(1, 10, 15, 'blue'),
--(1, 15, 20, null),--nulls may be explicit or implicit
(1, 20, 25, 'blue'),
--(1, 25, 30, null),--nulls may be explicit or implicit
(1, 30, 35, 'blue'),
(1, 35, 40, 'blue'),
(1, 0, 5, 'red'),
(1, 5, 10, 'red'),
(1, 10, 15, 'red'),
(1, 30, 35, 'red')

Desired results:

sidewalkId	color	start	end
1	'blue'	0	15
1	'red'	0	15
1	null	15	20
1	'blue'	20	25
1	null	25	30
1	'blue'	30	35
1	'red'	30	35
1	'blue'	35	40

Bad query 1:

    with cte as (
        select sidewalkId, color
            , row_number() over (partition by sidewalkId order by sidewalkId, "start") rna
            , row_number() over (partition by sidewalkId, color order by sidewalkId, "start") rnc
            , start, "end"
        from #chalk
    )
    select sidewalkId, color
        , min("start") "start", max("end") "end"
        , min(rna) rna_start, max(rna) rna_end
    from cte
    group by sidewalkId, color, rna-rnc
    order by sidewalkId, min(rna)

Bad results 1:

sidewalkId	color	start	end	rna_start	rna_end
1	blue	0	5	1	1
1	red	0	10	2	3
1	blue	5	15	4	5
1	red	10	15	6	6
1	NULL	15	20	7	7
1	blue	20	25	8	8
1	NULL	25	30	9	9
1	red	25	35	10	11
1	blue	30	40	12	13

Bad query 2:

    with cte as (
        select sidewalkId, color
            , row_number() over (partition by sidewalkId order by sidewalkId, color, "start") rna
            , row_number() over (partition by sidewalkId, color order by sidewalkId, "start") rnc
            , "start", "end"
        from #chalk
    )
    select sidewalkId, color
        , min("start"), max("end")
        , min(rna), max(rna)
    from cte
    group by sidewalkId, color, rna-rnc
    order by sidewalkId, min(rna)

Bad results 2:

sidewalkId	color	start	end	rna_start	rna_end
1	NULL	15	30	1	2
1	blue	0	40	3	8
1	red	0	35	9	13

I've tried using rank() or dense_rank() instead of row_number() (good comparison of those functions here) to get the desired results, but those don't seem to work for my requirements, either.

Guillaume Outters · Accepted Answer

The islands just have to be computed on the groups of sidewalking colors, not on the isolated colors themselves:
[0;5] is walked through by the blue,red band, then [5;10] too, and so on, [20;25] by the 1-member band blue, etc.

By first determining which color segments walk together,
and computing cte on it instead of #chalk,
you'll be able to take your first query nearly as is (see (1)):

with
  band as
  (
    select
      sidewalkId, start, "end",
      string_agg(color, ',') within group (order by color) name
    from #chalk
    group by sidewalkId, start, "end"
  ),
  cte as (
        select sidewalkId, name
            , row_number() over (partition by sidewalkId order by sidewalkId, start) rna
            , row_number() over (partition by sidewalkId, name order by sidewalkId, start) rnc
            , start, "end"
        from band
    )
  select cte.sidewalkId, color
      , min(cte.start) start, max(cte."end") "end"
      , min(rna) rna_start, max(rna) rna_end, cte.name bandname
  from cte join #chalk o on o.start = cte.start and o."end" = cte."end"
  group by cte.sidewalkId, name, color, rna-rnc
  order by cte.sidewalkId, min(rna)

(the fiddle includes additional columns with the "band names" such as 'blue,red'; another fiddle adds the splitting that @jed-schaaf you already did elsewhere to produce #chalk)

(1) In fact, after fully qualifying fields in your query (start → cte.start), the diff with my query is only:

 with
+  band as (…)
   cte as
   (
      […]
-     from #chalk
+     from band
   )
-  select cte.sidewalkId, cte.color
+  select cte.sidewalkId, o.color
   […]
   from cte
+  join #chalk o on o.start = cte.start and o."end" = cte."end"
-  group by cte.sidewalkId, cte.color, rna-rnc
+  group by cte.sidewalkId, o.color, rna-rnc

SQL Gaps and Islands with multiple overlapping criteria

Answers (2)

Related Questions

sidewalkid	color	pstart	pend	colors	grn
1	blue	0	15	blue-red	0
1	red	0	15	blue-red	0
1	null	15	20	null	1
1	blue	20	25	blue	2
1	null	25	30	null	3
1	blue	30	35	blue-red	4
1	red	30	35	blue-red	4
1	blue	35	40	blue	5
2	blue	0	10	blue	0
2	green	10	20	green-red	1
2	red	10	20	green-red	1
2	green	20	25	green-orange	2
2	orange	20	25	green-orange	2
2	yellow	25	35	yellow	3