Reputation: 1127
My (anonymized here) data has several completely separate sidewalks with various lines of chalk in different colors along each sidewalk. I need to find the start and end of each section of drawing with unique combinations of chalk colors. So far, that seems like a fairly straightforward gaps-and-islands problem (excellent Q&A here and explanation here). My issue is that I'd like each "island" to consolidate the longest possible contiguous section of equivalently overlapped values, but instead, it's either giving me every little overlapping piece separately or smudging all the islands for each color together, depending on how I try to partition or sort the row_number()
s.
Initial data:
create table #chalk (
sidewalkId int,
"start" int,
"end" int,
color varchar(50)
);
insert into #chalk values
(1, 0, 5, 'blue'),
(1, 5, 10, 'blue'),
(1, 10, 15, 'blue'),
--(1, 15, 20, null),--nulls may be explicit or implicit
(1, 20, 25, 'blue'),
--(1, 25, 30, null),--nulls may be explicit or implicit
(1, 30, 35, 'blue'),
(1, 35, 40, 'blue'),
(1, 0, 5, 'red'),
(1, 5, 10, 'red'),
(1, 10, 15, 'red'),
(1, 30, 35, 'red')
Desired results:
sidewalkId | color | start | end |
---|---|---|---|
1 | 'blue' | 0 | 15 |
1 | 'red' | 0 | 15 |
1 | null | 15 | 20 |
1 | 'blue' | 20 | 25 |
1 | null | 25 | 30 |
1 | 'blue' | 30 | 35 |
1 | 'red' | 30 | 35 |
1 | 'blue' | 35 | 40 |
Bad query 1:
with cte as (
select sidewalkId, color
, row_number() over (partition by sidewalkId order by sidewalkId, "start") rna
, row_number() over (partition by sidewalkId, color order by sidewalkId, "start") rnc
, start, "end"
from #chalk
)
select sidewalkId, color
, min("start") "start", max("end") "end"
, min(rna) rna_start, max(rna) rna_end
from cte
group by sidewalkId, color, rna-rnc
order by sidewalkId, min(rna)
Bad results 1:
sidewalkId | color | start | end | rna_start | rna_end |
---|---|---|---|---|---|
1 | blue | 0 | 5 | 1 | 1 |
1 | red | 0 | 10 | 2 | 3 |
1 | blue | 5 | 15 | 4 | 5 |
1 | red | 10 | 15 | 6 | 6 |
1 | NULL | 15 | 20 | 7 | 7 |
1 | blue | 20 | 25 | 8 | 8 |
1 | NULL | 25 | 30 | 9 | 9 |
1 | red | 25 | 35 | 10 | 11 |
1 | blue | 30 | 40 | 12 | 13 |
Bad query 2:
with cte as (
select sidewalkId, color
, row_number() over (partition by sidewalkId order by sidewalkId, color, "start") rna
, row_number() over (partition by sidewalkId, color order by sidewalkId, "start") rnc
, "start", "end"
from #chalk
)
select sidewalkId, color
, min("start"), max("end")
, min(rna), max(rna)
from cte
group by sidewalkId, color, rna-rnc
order by sidewalkId, min(rna)
Bad results 2:
sidewalkId | color | start | end | rna_start | rna_end |
---|---|---|---|---|---|
1 | NULL | 15 | 30 | 1 | 2 |
1 | blue | 0 | 40 | 3 | 8 |
1 | red | 0 | 35 | 9 | 13 |
I've tried using rank()
or dense_rank()
instead of row_number()
(good comparison of those functions here) to get the desired results, but those don't seem to work for my requirements, either.
Upvotes: 4
Views: 115
Reputation: 5916
lines of chalk
should be divided into the most minimal parts. (1, 0, 10, 'blue'),
(1, 5, 15, 'red'),
(1, 15, 20, 'red'),
(1, 25, 35, 'green'),
(1, 25, 35, 'yellow')
We take points (0,5,10,15,20,25,35). (subquery u
)
And for every part JOIN orginal lines (CTE ranges
)
( 0, 5)->(1,0,5,'blue')
( 5,10)->(1,5,10,'blue')
->(1,5,10,'red')
(10,15)->(1,10,15,'red')
(15,20)->(1,15,20,'red')
(20,25)->(1,20,25,null)
(25,35)->(1,25,35,'green')
->(1,25,35,'yellow')
This is already close to what you want to get, but you need to combine the chains where possible - to get longest lines.
( 0, 5)->(1,0,5,'blue')
( 5,10)->(1,5,10,'blue')
->(1,5,10,'red')
(10,20)->(1,10,20,'red') ( (1,10,15,'red')+(1,15,20,'red') )
(20,25)->(1,20,25,null)
(25,35)->(1,25,35,'green')
->(1,25,35,'yellow')
source consolidated isGap groupN
(1,0,5,'blue') (1,0,5,'blue') 0 0
(1,5,10,'blue') (1,5,10,'blue,red') 1 1
(1,5,10,'red')
(1,10,15,'red') (1,10,15,'red') 1 2
(1,15,20,'red') (1,15,20,'red') 0 2
(1,20,25,null) (1,20,25,null) 1 3
(1,25,35,'green') (1,25,35,'green,yellow') 1 4
(1,25,35,'yellow')
Aggregating rows, we get
(1,0,5,'blue') 0
(1,5,10,'blue,red') 1
(1,10,20,'red') 2
(1,20,25,null) 3
(1,25,35,'green,yellow') 4
Last step - expand rows, where several colors
(1,0,5,'blue') 0
(1,5,10,'blue') 1
(1,5,10,'red') 1
(1,10,20,'red') 2
(1,20,25,null) 3
(1,25,35,'green') 4
(1,25,35,'yellow') 4
See example
Test data (expanded)
sidewalkId | pstart | pend | color |
---|---|---|---|
1 | 0 | 5 | blue |
1 | 5 | 10 | blue |
1 | 10 | 15 | blue |
1 | 20 | 25 | blue |
1 | 30 | 35 | blue |
1 | 35 | 40 | blue |
1 | 0 | 5 | red |
1 | 5 | 10 | red |
1 | 10 | 15 | red |
1 | 30 | 35 | red |
2 | 0 | 10 | blue |
2 | 10 | 20 | red |
2 | 10 | 25 | green |
2 | 20 | 25 | orange |
2 | 25 | 35 | yellow |
with ranges as( -- devide all section's to detailed parts
select p.sidewalkid,p.point,p.next,c.pstart,c.pend,c.color
from(
select * ,lead(point)over(partition by sidewalkid order by point) next
from( -- all points in sidewalk's
select sidewalkid,pstart point from #chalk
union
select sidewalkid,pend point from #chalk
)u
)p -- intersection detailed sections and walk parts
left join #chalk c on p.sidewalkid=c.sidewalkid
and(
c.pstart<=p.point and c.pend>p.point
or
c.pstart<p.next and c.pend>=p.next
)
where next is not null
)
,detailed_path as( -- number of group (island)
select * ,sum(newGr)over(partition by sidewalkid order by point)grN
from( -- gaps
select *
,case when colors=lag(colors,1,colors)over(partition by sidewalkid order by point)
then 0
else 1
end newGr
from( -- consolidate equal parts of all colors
select sidewalkid,point,next
,string_agg(color,'-') within group(order by color) colors
from ranges
group by sidewalkid,point,next
)a
)b
)
,consolidated_path as(
select sidewalkid,min(point)pstart,max(next)pend
,min(colors)colors,grn
from detailed_path
group by sidewalkid,grn
)
-- expand back to every color in section
select sidewalkid,nullif(value,'') color, pstart, pend,colors, grn
from consolidated_path
cross apply string_split(coalesce(colors,''),'-')cl
order by sidewalkid,grn;
;
sidewalkid | color | pstart | pend | colors | grn |
---|---|---|---|---|---|
1 | blue | 0 | 15 | blue-red | 0 |
1 | red | 0 | 15 | blue-red | 0 |
1 | null | 15 | 20 | null | 1 |
1 | blue | 20 | 25 | blue | 2 |
1 | null | 25 | 30 | null | 3 |
1 | blue | 30 | 35 | blue-red | 4 |
1 | red | 30 | 35 | blue-red | 4 |
1 | blue | 35 | 40 | blue | 5 |
2 | blue | 0 | 10 | blue | 0 |
2 | green | 10 | 20 | green-red | 1 |
2 | red | 10 | 20 | green-red | 1 |
2 | green | 20 | 25 | green-orange | 2 |
2 | orange | 20 | 25 | green-orange | 2 |
2 | yellow | 25 | 35 | yellow | 3 |
CTE consolidated_path
output
sidewalkid | pstart | pend | colors | grn |
---|---|---|---|---|
1 | 0 | 15 | blue-red | 0 |
1 | 15 | 20 | null | 1 |
1 | 20 | 25 | blue | 2 |
1 | 25 | 30 | null | 3 |
1 | 30 | 35 | blue-red | 4 |
1 | 35 | 40 | blue | 5 |
2 | 0 | 10 | blue | 0 |
2 | 10 | 20 | green-red | 1 |
2 | 20 | 25 | green-orange | 2 |
2 | 25 | 35 | yellow | 3 |
Detailed_path
sidewalkid | point | next | colors | newGr | grN |
---|---|---|---|---|---|
1 | 0 | 5 | blue-red | 0 | 0 |
1 | 5 | 10 | blue-red | 0 | 0 |
1 | 10 | 15 | blue-red | 0 | 0 |
1 | 15 | 20 | null | 1 | 1 |
1 | 20 | 25 | blue | 1 | 2 |
1 | 25 | 30 | null | 1 | 3 |
1 | 30 | 35 | blue-red | 1 | 4 |
1 | 35 | 40 | blue | 1 | 5 |
2 | 0 | 10 | blue | 0 | 0 |
2 | 10 | 20 | green-red | 1 | 1 |
2 | 20 | 25 | green-orange | 1 | 2 |
2 | 25 | 35 | yellow | 1 | 3 |
Upvotes: 2
Reputation: 1607
The islands just have to be computed on the groups of sidewalking colors, not on the isolated colors themselves:
[0;5]
is walked through by the blue,red
band, then [5;10]
too, and so on, [20;25]
by the 1-member band blue
, etc.
By first determining which color segments walk together,
and computing cte
on it instead of #chalk
,
you'll be able to take your first query nearly as is (see (1)):
with
band as
(
select
sidewalkId, start, "end",
string_agg(color, ',') within group (order by color) name
from #chalk
group by sidewalkId, start, "end"
),
cte as (
select sidewalkId, name
, row_number() over (partition by sidewalkId order by sidewalkId, start) rna
, row_number() over (partition by sidewalkId, name order by sidewalkId, start) rnc
, start, "end"
from band
)
select cte.sidewalkId, color
, min(cte.start) start, max(cte."end") "end"
, min(rna) rna_start, max(rna) rna_end, cte.name bandname
from cte join #chalk o on o.start = cte.start and o."end" = cte."end"
group by cte.sidewalkId, name, color, rna-rnc
order by cte.sidewalkId, min(rna)
(the fiddle includes additional columns with the "band names" such as 'blue,red';
another fiddle adds the splitting that @jed-schaaf you already did elsewhere to produce #chalk
)
(1) In fact, after fully qualifying fields in your query (start
→ cte.start
), the diff with my query is only:
with
+ band as (…)
cte as
(
[…]
- from #chalk
+ from band
)
- select cte.sidewalkId, cte.color
+ select cte.sidewalkId, o.color
[…]
from cte
+ join #chalk o on o.start = cte.start and o."end" = cte."end"
- group by cte.sidewalkId, cte.color, rna-rnc
+ group by cte.sidewalkId, o.color, rna-rnc
Upvotes: 2