The Great
The Great

Reputation: 7723

Derive end date by shifting unique values using pandas

I have a dataframe like as shown below

df = pd.DataFrame({'subjectID' :[1,1,1,1,1,1,2,2,2],'start_date': 
  ['10/30/2006 08:04','10/30/2006 08:04','11/30/2006 11:01','12/30/2006 02:04','02/13/2007 07:24','04/20/2007 08:34',
   '02/20/2007 06:44','01/20/2007 03:44','11/09/2006 02:03']})

What I would like to do is

1) sort the dataframe by two columns (subject_id and start_date for each subject). Meaning for each subject the start date has to be arranged in ascending order. Like group by subject_id and sort their start_date values

2) Fill the end_date column for each patient by fetching the next unique value of start_date for the same patient.

3) As we follow step 2, we will encounter that last record of each subject will have no other values look beyond, so we just add 10 days to get the end date for last record for each subject

This is what I tried

df = df.sort_values(['subjectID', 'start_date'], ascending=[True,True]) # works but is this same as group by and sort after?
df['end_date'] = df.groupby('subjectID')['start_date'].unique().shift(-1) # this doesn't work

I expect my output to be like as shown below

enter image description here

Upvotes: 1

Views: 397

Answers (1)

jezrael
jezrael

Reputation: 862841

Use:

#converting to datetimes
df['start_date'] = pd.to_datetime(df['start_date'])

#sorting
df = df.sort_values(['subjectID', 'start_date'])
#created timestamp for each last value of group and added 10 days
s = df.drop_duplicates('subjectID', keep='last')['start_date'] + pd.Timedelta(10, unit='d')

#shifting without duplicates, added missing values by reindex and forward filling 
df['end_date'] = (df.drop_duplicates(['subjectID','start_date'])
                   .groupby('subjectID')['start_date']
                   .shift(-1)
                   .reindex(df.index)
                   .groupby(df['subjectID'])
                   .ffill()
                   )
#last set last values of groups
df['end_date'].update(s)
print (df)
  subjectID          start_date            end_date
0          1 2006-10-30 08:04:00 2006-11-30 11:01:00
1          1 2006-10-30 08:04:00 2006-11-30 11:01:00
2          1 2006-11-30 11:01:00 2006-12-30 02:04:00
3          1 2006-12-30 02:04:00 2007-02-13 07:24:00
4          1 2007-02-13 07:24:00 2007-04-20 08:34:00
5          1 2007-04-20 08:34:00 2007-04-30 08:34:00
8          2 2006-11-09 02:03:00 2007-01-20 03:44:00
7          2 2007-01-20 03:44:00 2007-02-20 06:44:00
6          2 2007-02-20 06:44:00 2007-03-02 06:44:00

Upvotes: 2

Related Questions