Novitskiy Denis
Novitskiy Denis

Reputation: 145

Find uninterruptible groups of dates

I have data like this

Start        End
2022-01-01   2022-01-31
2022-02-01   2022-02-28
2022-03-05   2022-04-05
2022-05-01   2022-05-31
2022-06-01   2022-06-30
2022-07-01   2022-07-15
2022-07-05   2022-07-16
2022-07-15   2022-08-01

I want to find all continuous periods (where there is no more than 1 day between the start and the end) An example of the desired result

2022-01-01   2022-02-28
2022-03-05   2022-04-05
2022-05-01   2022-08-01

Upvotes: 0

Views: 128

Answers (4)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172994

i`m add some new data to my example. Periods can cross each other

in this case use dense_rank instead of row_number

select min(day) as start, max(day) as `end` 
from (
  select day, unix_date(day) - dense_rank() over(order by day) as grp
  from your_table, unnest(generate_date_array(date(start), date(`end`))) day
)
group by grp             

with output

enter image description here

Upvotes: 0

Samuel
Samuel

Reputation: 3528

For overlapping date ranges, the best way is to create each day in these ranges and combinde them. Searching for date gaps is the next step. However, this generates a huge amount of rows, depending on the date range. See also the solution of @mikhail-berlyant

with tbl as 
(
  Select date("2022-01-01") as start,date("2022-01-31") as ending
union all Select date("2022-02-01"),date("2022-02-28")
union all Select date("2022-03-02"),date("2022-03-02")
union all Select date("2022-03-05"),date("2022-04-05")
union all Select date("2022-05-01"),date("2022-05-31")
union all Select date("2022-06-01"),date("2022-06-30")
union all Select date("2022-07-01"),date("2022-07-12")
union all Select date("2022-06-05"),date("2022-07-16")
union all Select date("2022-07-15"),date("2022-08-01")
),

tmp as (
  Select toDAY,
  date_diff(today,lag(today) over (order by today),day)-1 as diff
  from (Select DISTINCT toDAY from tbl, unnest(generate_date_array(start,ending)) as toDAY)
),
tmp2 as (
  Select toDay,
  sum(ifnull(diff,0)) over (order by today range between unbounded preceding and current row) as grp
  from tmp
)

select
 grp, min(today),max(today) from tmp2
group by 1

For non overlapping ranges, I suggested this solution:

There is the need to combine the value between rows. LEAD and LAG are these window functions. With these the comparision of the start date with the previous end date is possible. The same is done for the end date with the next start date. Thus we obain two columns A and B_ which have only an entrie for breaks. We need to filter all rows where A and B_ have both nulls, qualify helps here to prevent a further select statement. The A and B_ dates are now split often over two rows, therefore the end date in column B is put in the above row, if this one is null: ifnull(B_,lead(B_) over (order by start))

with tbl as 
(
  Select date("2022-01-01") as start,date("2022-01-31") as ending
union all Select date("2022-02-01"),date("2022-02-28")
union all Select date("2022-03-05"),date("2022-04-05")
union all Select date("2022-05-01"),date("2022-05-31")
union all Select date("2022-06-01"),date("2022-06-30")
union all Select date("2022-07-01"),date("2022-07-15")
)

Select A, ifnull(B_,lead(B_) over (order by start)) B
from
(
Select *,
if(ifnull(date_diff(start,lag(ending) over (order by Start),day),99)>1,start,null) A,
if(ifnull(date_diff(ending,lead(start) over (order by Start),day),-99)<-1,ending,null) B_
 from tbl
Qualify A is not null or B_ is not null
#order by Start
)
Qualify A is not null
order by A

Upvotes: 2

Jaytiger
Jaytiger

Reputation: 12234

Consider below approach.

WITH partitions AS (
  SELECT *, COUNTIF(flag) OVER (ORDER BY start) par FROM (
    SELECT *, DATE_DIFF(start, LAG(`end`) OVER (ORDER BY start), DAY) > 1 flag
      FROM sample_table
  )
)
SELECT MIN(start) start, MAX(`end`) `end`
  FROM partitions
 GROUP BY par;
Query results
+------------+------------+
|   start    |    end     |
+------------+------------+
| 2022-01-01 | 2022-02-28 |
| 2022-03-05 | 2022-04-05 |
| 2022-05-01 | 2022-08-01 |
+------------+------------+

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172994

Consider yet another approach

select min(day) as start, max(day) as `end` 
from (
  select day, unix_date(day) - row_number() over(order by day) as grp
  from your_table, unnest(generate_date_array(date(start), date(`end`))) day
)
group by grp             

if applied to sample data in your question - output is

enter image description here

Upvotes: 2

Related Questions