DForck42
DForck42

Reputation: 20357

Collapse consecutive similar records into a single record

I have records of people from an old system that I'm trying to convert over to the new system. In the old system, a person might end up with several records for the same location. They could also go from location, to another, and then return to the previous location. Here's some example data:

PersonID  | LocationID | StartDate  | EndDate
1         | 1          | 1980-07-30 | 2007-07-16
1         | 1          | 2007-07-16 | 2008-01-30
1         | 2          | 2008-01-30 | 2009-03-02
1         | 2          | 2009-03-02 | 2009-11-06
1         | 3          | 2014-07-16 | 2015-01-16
1         | 1          | 2016-01-26 | 2999-12-31

I would like to collapse this data so that I get a date range for any consecutive LocationIDs. For the data above, this is what I would expect:

PersonID  | LocationID | StartDate  | EndDate
1         | 1          | 1980-07-30 | 2008-01-30
1         | 2          | 2008-01-30 | 2009-11-06
1         | 3          | 2014-07-16 | 2015-01-16
1         | 1          | 2016-01-26 | 2999-12-31

I'm unsure as to how to do this. I previously tried joining to the previous record, but that only works when there's two consecutive locations, not with 3 or more (there could be an undefined number of consecutive records).

select
    a.PersonID,
    a.LocationID,
    a.StartDate,
    a.EndDate,
    case when a.LocationID = b.LocationID then a.PK_ID else b.PK_ID end as NewID
from employees a
left outer join employees b
on a.PersonID = b.PersonID
and a.PK_ID = b.PK_ID - 1

So, how can I write a query to get the results I need?

Note: we're treating '2999-12-31' are our 'NULL' date field

Upvotes: 0

Views: 145

Answers (2)

John Cappelletti
John Cappelletti

Reputation: 81970

This is a classic Gaps-and-Islands (Edit- corrected for larger span 2999)

Select [PersonID]
      ,[LocationID]
      ,[StartDate]  = min(D)
      ,[EndDate]    = max(D)
 From (
        Select *
              ,Grp = Row_Number() over (Order By D) - Row_Number() over (Partition By [PersonID],[LocationID] Order By D) 
         from YourTable A
         Cross Apply (
                        Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])  
                        From  master..spt_values n1,master..spt_values n2
                     ) B
      ) G
 Group By [PersonID],[LocationID],Grp
 Order By [PersonID],min(D)

Returns

PersonID    LocationID  StartDate   EndDate
1           1           1980-07-30  2008-01-30
1           2           2008-01-30  2009-11-06
1           3           2014-07-16  2015-01-16
1           1           2016-01-26  2999-12-31

Using your original query

Select [PersonID]
      ,[LocationID]
      ,[StartDate]  = min(D)
      ,[EndDate]    = max(D)
 From (
        Select *
              ,Grp = Row_Number() over (Order By D) - Row_Number() over (Partition By [PersonID],[LocationID] Order By D) 
         From (
                -- Your Original Query
                select
                    a.PersonID,
                    a.LocationID,
                    a.StartDate,
                    a.EndDate,
                    case when a.LocationID = b.LocationID then a.PK_ID else b.PK_ID end as NewID
                from employees a
                left outer join employees b
                on a.PersonID = b.PersonID
                and a.PK_ID = b.PK_ID - 1
              ) A
         Cross Apply (
                        Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])  
                        From  master..spt_values n1,master..spt_values n2
                     ) B
      ) G
 Group By [PersonID],[LocationID],Grp
 Order By [PersonID],min(D)

Requested Comments

Let's break it down to its components.

1) The CROSS APPLY Portion: This will expand a single record into N records. For example:

Declare @YourTable Table ([PersonID] int,[LocationID] int,[StartDate] date,[EndDate] date)
Insert Into @YourTable Values
 (1,1,'1980-07-01','1980-07-03' )
,(1,1,'1980-07-02','1980-07-04' )  -- Notice the Overlap
,(1,2,'2008-01-30','2008-02-05')

Select *
    from @YourTable A
    Cross Apply (
                Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])  
                From  master..spt_values n1,master..spt_values n2
                ) B

The above query will generate

enter image description here

2) The Grp Portion: Perhaps easier if I provide a simple example:

Declare @YourTable Table ([PersonID] int,[LocationID] int,[StartDate] date,[EndDate] date)
Insert Into @YourTable Values
 (1,1,'1980-07-01','1980-07-03' )
,(1,1,'1980-07-02','1980-07-04' )  -- Notice the Overlap
,(1,2,'2008-01-30','2008-02-05')

Select *
      ,Grp = Row_Number() over (Order By D) - Row_Number() over (Partition By [PersonID],[LocationID] Order By D) 
      ,RN1 = Row_Number() over (Order By D)
      ,RN2 = Row_Number() over (Partition By [PersonID],[LocationID] Order By D) 
    from @YourTable A
    Cross Apply (
                Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])  
                From  master..spt_values n1,master..spt_values n2
                ) B

The above query Generates:

enter image description here

RN1 and RN2 are breakouts of the GRP, just to illustrate the mechanic. Notice RN1 minus RN2 equals the GRP. Once we have the GRP, it becomes a simple matter of aggregation via a group by

3) Pulling it all Together:

Declare @YourTable Table ([PersonID] int,[LocationID] int,[StartDate] date,[EndDate] date)
Insert Into @YourTable Values
 (1,1,'1980-07-01','1980-07-03' )
,(1,1,'1980-07-02','1980-07-04' )  -- Notice the Overlap
,(1,2,'2008-01-30','2008-02-05')

Select [PersonID]
      ,[LocationID]
      ,[StartDate]  = min(D)
      ,[EndDate]    = max(D)
 From (
        Select *
              ,Grp = Row_Number() over (Order By D) - Row_Number() over (Partition By [PersonID],[LocationID] Order By D) 
            from @YourTable A
            Cross Apply (
                        Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])  
                        From  master..spt_values n1,master..spt_values n2
                        ) B
      ) G
 Group By [PersonID],[LocationID],Grp
 Order By [PersonID],min(D)

Returns

enter image description here

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269923

For your sample data, you can use the difference of row numbers approach:

select personid, locationid, min(startdate), max(enddate)
from (select e.*,
             row_number() over (partition by personid order by startdate) as seqnum_p,
             row_number() over (partition by personid, locationid order by startdate) as seqnum_pl
      from employees  e
     ) e
group by (seqnum_p - seqnum_pl), personid, locationid;

This assumes that the start and end dates are contiguous. That is, there is no gap for a given employee at the same location.

Upvotes: 1

Related Questions