The Great
The Great

Reputation: 7743

Use pandas groupby to fetch frequency count based on time intervals

I have a dataframe like as shown below

df = pd.DataFrame({'subject_id':[1,1,1,2,2,2],
              'start_time':['2130-03-25 18:51:47','2130-04-23 18:51:47','2130-04-23 18:51:47','2120-01-11 18:51:47','2120-01-11 18:51:47','2120-04-28 18:51:47'],
              'test_time':['2130-03-26 14:51:47','2130-04-24 18:51:47','2130-04-25 18:51:47','2121-02-26 18:51:47','2121-02-26 18:51:47','2120-04-28 19:51:47'],
              'test':['test1','test2','test2','test2','test3','test3']})
df['start_time'] = pd.to_datetime(df['start_time'])
df['test_time'] = pd.to_datetime(df['test_time'])

What I would like to do is

a) Get the number of tests done for each subject every 24 hours from start_time. The test time can be found from the test_time column

Example - By 24hours, I mean 0-24hours, 24-48hours, 48-72hours etc.

I tried the below

df['time_diff'] = (df.test_time - df.start_time) / pd.Timedelta(hours=1)
conditions = [
    (df['time_diff'] >= 0) & (df['time_diff'] <= 24),
    (df['time_diff'] >24 ) & (df['time_diff'] <= 48),
    (df['time_diff'] > 48) & (df['time_diff'] <= 72)]
choices = ['0-24hrs','24-48hrs','48-72hrs']
df['op'] = np.select(conditions, choices, default='Greater than 3 days')
df.groupby(['subject_id','test','op'])['test'].count()

However the above produces an output in incorrect format.

I expect my output to be like as shown below

enter image description here

Upvotes: 0

Views: 127

Answers (1)

BENY
BENY

Reputation: 323396

You can just adding unstack

out = df.groupby(['subject_id','test','op'])['test'].count().unstack(fill_value=0).reset_index()
out
op  subject_id   test  0-24hrs  24-48hrs  Greater than 3 days
0            1  test1        1         0                    0
1            1  test2        1         1                    0
2            2  test2        0         0                    1
3            2  test3        1         0                    1

Upvotes: 1

Related Questions