Reputation: 7743
I have a dataframe like as shown below
df = pd.DataFrame({'subject_id':[1,1,1,2,2,2],
'start_time':['2130-03-25 18:51:47','2130-04-23 18:51:47','2130-04-23 18:51:47','2120-01-11 18:51:47','2120-01-11 18:51:47','2120-04-28 18:51:47'],
'test_time':['2130-03-26 14:51:47','2130-04-24 18:51:47','2130-04-25 18:51:47','2121-02-26 18:51:47','2121-02-26 18:51:47','2120-04-28 19:51:47'],
'test':['test1','test2','test2','test2','test3','test3']})
df['start_time'] = pd.to_datetime(df['start_time'])
df['test_time'] = pd.to_datetime(df['test_time'])
What I would like to do is
a) Get the number of tests done for each subject every 24 hours from start_time
. The test time can be found from the test_time
column
Example - By 24hours, I mean 0-24hours
, 24-48hours
, 48-72hours
etc.
I tried the below
df['time_diff'] = (df.test_time - df.start_time) / pd.Timedelta(hours=1)
conditions = [
(df['time_diff'] >= 0) & (df['time_diff'] <= 24),
(df['time_diff'] >24 ) & (df['time_diff'] <= 48),
(df['time_diff'] > 48) & (df['time_diff'] <= 72)]
choices = ['0-24hrs','24-48hrs','48-72hrs']
df['op'] = np.select(conditions, choices, default='Greater than 3 days')
df.groupby(['subject_id','test','op'])['test'].count()
However the above produces an output in incorrect format.
I expect my output to be like as shown below
Upvotes: 0
Views: 127
Reputation: 323396
You can just adding unstack
out = df.groupby(['subject_id','test','op'])['test'].count().unstack(fill_value=0).reset_index()
out
op subject_id test 0-24hrs 24-48hrs Greater than 3 days
0 1 test1 1 0 0
1 1 test2 1 1 0
2 2 test2 0 0 1
3 2 test3 1 0 1
Upvotes: 1