bogdanCsn
bogdanCsn

Reputation: 1325

Collapsing rows with overlapping dates

I'm trying to collapse the below dataframe into rows containing continuous time periods by id. Continuous means that, within the same id, the start_date is either less than, equal or at most one day greater than the end_date of the previous row (the data is already sorted by id, start_date and end_date). All rows that are continuous should be output as a single row, with start_date being the minimum start_date (i.e. the start_date of the first row in the continuous set) and end_date being the maximum end_date from the continuous set of rows.
Please see the desired output at the bottom.
The only way I can think of approaching this is by parsing the dataframe row by row, but I was wondering if there a more pythonic and elegant way to do it.

id  start_date  end_date
1   2017-01-01  2017-01-15
1   2017-01-12  2017-01-24
1   2017-01-25  2017-02-03
1   2017-02-05  2017-02-14
1   2017-02-16  2017-02-28
2   2017-01-01  2017-01-19
2   2017-01-24  2017-02-07
2   2017-02-07  2017-02-20

Here is the code to generate the input dataframe:

import numpy as np
import pandas as pd

start_date = ['2017-01-01','2017-01-12','2017-01-25','2017-02-05','2017-02-16',
              '2017-01-01','2017-01-24','2017-02-07']
end_date = ['2017-01-15','2017-01-24','2017-02-03','2017-02-14','2017-02-28',
            '2017-01-19','2017-02-07','2017-02-20']

df = pd.DataFrame({'id': [1,1,1,1,1,2,2,2],
                  'start_date': pd.to_datetime(start_date, format='%Y-%m-%d'),     
                  'end_date': pd.to_datetime(end_date, format='%Y-%m-%d')})

Desired output:

id  start_date  end_date
1   2017-01-01  2017-02-03
1   2017-02-05  2017-02-14
1   2017-02-16  2017-02-28
2   2017-01-01  2017-01-19
2   2017-01-24  2017-02-20

Upvotes: 2

Views: 455

Answers (1)

Allen Qin
Allen Qin

Reputation: 19947

def f(grp):
    #define a list to collect valid start and end ranges
    d=[]
    (
        #append a new row if the start date is at least 2 days greater than the last date from previous row,
        #otherwise update last rows's end date with current row's end date.
        grp.reset_index(drop=True)
           .apply(lambda x: d.append({x.start_date:x.end_date}) 
                            if x.name==0 or (x.start_date-pd.DateOffset(1))>grp.iloc[x.name-1].end_date 
                            else d[-1].update({list(d[-1].keys())[0]:x.end_date}),
                  axis=1)
    )
    #reconstruct a df using only valid start and end dates pairs.
    return pd.DataFrame([[list(e.keys())[0],list(e.values())[0]] for e in d],
                        columns=['start_date','end_date'])

df.groupby('id').apply(f).reset_index().drop('level_1',1)
Out[467]: 
   id start_date   end_date
0   1 2017-01-01 2017-02-03
1   1 2017-02-05 2017-02-14
2   1 2017-02-16 2017-02-28
3   2 2017-01-01 2017-01-19
4   2 2017-01-24 2017-02-20

Upvotes: 1

Related Questions