shsh
shsh

Reputation: 727

Groupby 'ID' With Timestamp Difference And Maintaining The Other Columns In Python

I want to make six columns ('ID', 'Earliest', 'Latest', 'Difference_hours', 'Difference_minutes', 'Serial') out of this partial dataframe:

ID    Timestamp        Serial
A     2019-01-01 00:01    F
A     2019-01-01 00:04    F
A     2019-01-01 00:06    F
B     2019-01-03 12:35    G
B     2019-01-01 13:35    G
.
.
.

The partial desired output is:

ID     Earliest            Latest            Difference_hours  Difference_minutes  Serial
A      2019-01-01 00:01    2019-01-01 00:06        0.083               5             F
A      2019-01-03 12:35    2019-01-04 13:35          1                 60            G
.
.
.

Column Description:

'Earliest':The earliest timestamp from the same 'ID',
'Latest': The latest from the same 'ID' ,
'Difference_hours': Timestamp difference between 'Earliest' and 'Latest' in hours,
'Difference_minutes': Timestamp difference between 'Earliest' and 'Latest' in minutes .

Note: 'Serial' is constant for every unique 'ID'

I know I need to use groupby and maybe with transform, but don't know how to do it exactly.

Upvotes: 0

Views: 143

Answers (1)

BhishanPoudel
BhishanPoudel

Reputation: 17164

This should work:

Question

df = pd.DataFrame({'ID': ['A', 'A', 'A', 'B', 'B'],
          'Timestamp': ['2019-01-01 00:01', '2019-01-01 00:04', 
                        '2019-01-01 00:06', '2019-01-03 12:35', '2019-01-04 13:35'],
          'Serial': ['F', 'F', 'F', 'G', 'G']})
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
print(df)
  ID           Timestamp Serial
0  A 2019-01-01 00:01:00      F
1  A 2019-01-01 00:04:00      F
2  A 2019-01-01 00:06:00      F
3  B 2019-01-03 12:35:00      G
4  B 2019-01-04 13:35:00      G

Answer

ans = df.groupby(['ID','Serial'])['Timestamp'].agg(['min','max']).reset_index()
ans['hr_diff'] = ans['max'].dt.hour - ans['min'].dt.hour
ans['min_diff'] = ans['max'].dt.minute - ans['min'].dt.minute

print(ans)
  ID Serial                 min                 max  hr_diff  min_diff
0  A      F 2019-01-01 00:01:00 2019-01-01 00:06:00        0         5
1  B      G 2019-01-03 12:35:00 2019-01-04 13:35:00        1         0

If you have pandas > 0.25

ans = df.groupby(['ID','Serial'])['Timestamp'].agg(min='min',
                                                   max='max',
                                                   diff_hr=lambda x: x.max().hour - x.min().hour,
                                                   diff_min=lambda x: x.max().minute - x.min().minute
                                                   )
ans.reset_index()
  ID Serial                 min                 max  diff_hr  diff_min
0  A      F 2019-01-01 00:01:00 2019-01-01 00:06:00        0         5
1  B      G 2019-01-03 12:35:00 2019-01-04 13:35:00        1         0

Upvotes: 1

Related Questions