Brian Stork
Brian Stork

Reputation: 980

SQL Server Compressing Adjacent Date Ranges

I have a table that holds a person id and a date range (a start date and a stop date). Each person may have multiple rows with multiple start and stop dates.

create table #DateRanges (
   tableID   int not null,
   personID  int not null,
   startDate date,
   endDate   date
);
insert #DateRanges (tableID, personID, startDate, endDate)
values (1, 100, '2011-01-01', '2011-01-31') -- Just January
     , (2, 100, '2011-02-01', '2011-02-28') -- Just February
     , (3, 100, '2011-04-01', '2011-04-30') -- April - Skipped March
     , (4, 100, '2011-05-01', '2011-05-31') -- May
     , (5, 100, '2011-06-01', '2011-12-31') -- June through December

I need a way to collapse adjacent date ranges (where the end date of the previous row is exactly one day before the start date of the next). But it must include all contiguous ranges, splitting only when the end-to-start gap is greater than one day. The above data needs to be compressed to:

+-----------+----------+--------------+------------+
| SomeNewID | PersonID | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
|        1  |     100  |   2011-01-01 | 2011-02-28 |
+-----------+----------+--------------+------------+
|        2  |     100  |   2011-04-01 | 2011-12-31 |
+-----------+----------+--------------+------------+

Just two rows because the only missing range is March. Now, if all of march were present, either as one row or many rows, the compression would result in just one row. But if only two days in the middle of March were present, we would get a 3rd row to show the March dates.

I've been working with LEAD and LAG functions in SQL 2016 to try to get this done as a recordset operation, but so far have come up empty. I would like to be able to do it without a loop and RBAR, but I'm not seeing a solution.

Upvotes: 1

Views: 806

Answers (3)

andresch
andresch

Reputation: 101

While this question is already quite dated, I believe the problem is still worth an answer.

If your data guarantees that the ranges don't overlap then you can compress ranges, by applying 3 steps:

  1. Mark the rows that start a new range
  2. Assign all rows to their consecutive sequence of ranges
  3. Group the entries of each sequence
  • For (1) you use the lag() window function over the startDate (and any partition criteria) to mark all rows that don't have a directly preceding range.
  • For (2) you use count() as window functions, to count how many new sequences of ranges have started at or before the current row and use this as sequence id. The trick here is to limit the rows you count
  • For for (3) you then apply a group by on the sequences and partition criteria

Here's a code example for sqlite:

with RangesWithStart as (
  select
    *
    , coalesce( 
        date(
          lag(endDate) over (
            partition by personID 
            order by startDate
          )
          , '+1 day'
        ) <> startDate
        , true
      ) as isRangeStart
  from 
    DateRanges
)
, RangesWithSequences as (
  select 
    *
    , count(nullif(isRangeStart,false)) over (
        partition by personID 
        order by startDate 
        rows unbounded preceding
      ) as rangeSeqenceID 
  from 
    RangesWithStart
)
select 
  rangeSeqenceID as newID
  , personID
  , min(startDate) as startDate
  , max(endDate) as endDate
from RangesWithSequences
group by 1,2  

Upvotes: 0

Brian Stork
Brian Stork

Reputation: 980

After working on this for a few days, I think I have a solution I wanted to share in case any one else needed something similar. I used a few CTEs to find lead, lag, and gap times, distill the rows down to only significant start and stop dates, then use more lead and lag to find the compressed start and stop dates. There may be an easier way, but I think this handles the day-level resolution nicely.

with LeadAndLagAndGap as (
   select
      tableid,
      personID,
      startDate,
      endDate,
      lag(endDate) over (partition by personID order by startDate) as previousEnd,
      lead(startDate) over (partition by personID order by startDate) as nextStart,
      coalesce(datediff(day,endDate,lead(startDate) over (partition by personID order by startDate))-1,0) as gap
   from
      #DateRanges
), OnlyStartAndEndRows as (
   select
      tableid,
      personID,
      startDate,
      endDate,
      previousEnd,
      nextStart,
      gap
   from
      LeadAndLagAndGap
   where
      previousEnd is null  -- Definitely FIRST record in a range
      or nextStart is null -- Definitely LAST record in a range
      or gap > 0           -- Definitely an end of a range, nextStart is definitely the start of a range.
), PreCollapseReaggregate as (
   select
      tableid,
      personID,
      startDate,
      endDate,
      previousEnd,
      nextStart,
      gap,
      case
         when previousEnd is null then startDate
         when gap > 0 then nextStart
      end as DefiniteStart,
      case
         when nextStart is null then endDate
         when gap > 0 then endDate
      end as DefiniteEnd
   from
      OnlyStartAndEndRows
), Collapsed as (
   select
      tableid,
      personID,
      DefiniteStart as startDate,
      case
         when definiteEnd is null or gap > 0 then lead(definiteEnd) over (partition by personid order by startdate)
         when definiteStart is not null and DefiniteEnd is not null then definiteEnd
      end as endDate
     from PreCollapseReaggregate
)
select * from Collapsed
where enddate is not null

Upvotes: 0

Kannan Kandasamy
Kannan Kandasamy

Reputation: 13969

You can use lag and get the correct bucket and then do the group-by as below:

;with cte1 as (
    select *,dtdiff = datediff(day, lag(startdate, 1, null) over (partition by personid order by startdate), startDate) --Getting date difference for grouping
     from #DateRanges
        ),
cte2 as (
    select *, grp = sum(case when dtdiff is null or dtdiff>50 then 1 else 0 end) over (order by startdate) -- Creating bucket for min/max
        from cte1
        )
        select SomeNewId = Row_Number() over (order by (select null)), Personid, NewStartDate = min(startdate), NewEndDate = max(enddate) --Getting min/max based on bucket
            from cte2 group by PersonId, grp

Your output:

+-----------+----------+--------------+------------+
| SomeNewId | Personid | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
|         1 |      100 | 2011-01-01   | 2011-02-28 |
|         2 |      100 | 2011-04-01   | 2011-12-31 |
+-----------+----------+--------------+------------+

My testing input:

insert #DateRanges (tableID, personID, startDate, endDate)
values (1, 100, '2011-01-01', '2011-01-31') -- Just January
     , (2, 100, '2011-02-01', '2011-02-28') -- Just February
     , (3, 100, '2011-04-01', '2011-04-30') -- April - Skipped March
     , (4, 100, '2011-05-01', '2011-05-31') -- May
     , (5, 100, '2011-06-01', '2011-06-30') -- More gaps
     , (6, 100, '2011-07-01', '2011-07-31') -- More gaps
     , (7, 100, '2011-08-01', '2011-08-31') -- More gaps
     , (8, 100, '2011-10-01', '2011-10-31') -- More gaps
     , (9, 100, '2011-11-01', '2011-11-30') -- More gaps

Output for testing data:

+-----------+----------+--------------+------------+
| SomeNewId | Personid | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
|         1 |      100 | 2011-01-01   | 2011-02-28 |
|         2 |      100 | 2011-04-01   | 2011-08-31 |
|         3 |      100 | 2011-10-01   | 2011-11-30 |
+-----------+----------+--------------+------------+

Upvotes: 0

Related Questions