Reputation: 7723
I have a dataframe like as shown below
df = pd.DataFrame({'subjectID' :[1,1,1,1,1,1,2,2,2],'start_date':
['10/30/2006 08:04','10/30/2006 08:04','11/30/2006 11:01','12/30/2006 02:04','02/13/2007 07:24','04/20/2007 08:34',
'02/20/2007 06:44','01/20/2007 03:44','11/09/2006 02:03']})
What I would like to do is
1) sort the dataframe by two columns (subject_id and start_date for each subject). Meaning for each subject the start date has to be arranged in ascending order. Like group by subject_id and sort their start_date values
2) Fill the end_date column for each patient by fetching the next unique value of start_date for the same patient.
3) As we follow step 2, we will encounter that last record of each subject will have no other values look beyond, so we just add 10 days to get the end date for last record for each subject
This is what I tried
df = df.sort_values(['subjectID', 'start_date'], ascending=[True,True]) # works but is this same as group by and sort after?
df['end_date'] = df.groupby('subjectID')['start_date'].unique().shift(-1) # this doesn't work
I expect my output to be like as shown below
Upvotes: 1
Views: 397
Reputation: 862841
Use:
#converting to datetimes
df['start_date'] = pd.to_datetime(df['start_date'])
#sorting
df = df.sort_values(['subjectID', 'start_date'])
#created timestamp for each last value of group and added 10 days
s = df.drop_duplicates('subjectID', keep='last')['start_date'] + pd.Timedelta(10, unit='d')
#shifting without duplicates, added missing values by reindex and forward filling
df['end_date'] = (df.drop_duplicates(['subjectID','start_date'])
.groupby('subjectID')['start_date']
.shift(-1)
.reindex(df.index)
.groupby(df['subjectID'])
.ffill()
)
#last set last values of groups
df['end_date'].update(s)
print (df)
subjectID start_date end_date
0 1 2006-10-30 08:04:00 2006-11-30 11:01:00
1 1 2006-10-30 08:04:00 2006-11-30 11:01:00
2 1 2006-11-30 11:01:00 2006-12-30 02:04:00
3 1 2006-12-30 02:04:00 2007-02-13 07:24:00
4 1 2007-02-13 07:24:00 2007-04-20 08:34:00
5 1 2007-04-20 08:34:00 2007-04-30 08:34:00
8 2 2006-11-09 02:03:00 2007-01-20 03:44:00
7 2 2007-01-20 03:44:00 2007-02-20 06:44:00
6 2 2007-02-20 06:44:00 2007-03-02 06:44:00
Upvotes: 2