constiii
constiii

Reputation: 648

Calculating average differences with groupby in Python

I'm new to Python and I want to aggregate (groupby) ID's in my first column. The values in the second column are timestamps (datetime format) and by aggregating the ID's, I want the to get the average difference between the timestamps (in days) per ID in the aggregated ID column. My table looks like df1 and I want something like df2, but since I'm an absolute beginner, I have no idea how to do this.

import pandas as pd
import numpy as np
from datetime import datetime

In[1]:
# df1
ID = np.array([1,1,1,2,2,3])
Timestamp = np.array([
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-08 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-03-15 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-02-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S")])
df = pd.DataFrame({'ID': ID, 'Timestamp': Timestamp})

Out[1]:   
    ID  Timestamp
0   1   2018-01-01 18:07:02
1   1   2018-01-08 18:07:02
2   1   2018-03-15 18:07:02
3   2   2018-01-01 18:07:02
4   2   2018-02-01 18:07:02
5   3   2018-01-01 18:07:02

In[2]:
#df2
ID = np.array([1,2,3])
    Avg_Difference = np.array([7, 1, "nan"])
    df2 = pd.DataFrame({'ID': ID, 'Avg_Difference': Avg_Difference})

Out[2]:
ID  Avg_Difference
   0    1   7 
   1    2   1 
   2    3   nan

Upvotes: 5

Views: 1811

Answers (1)

sacuL
sacuL

Reputation: 51425

You could do something like this:

df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean())

In your case, it looks like:

>>> df
   ID           Timestamp
0   1 2018-01-01 18:07:02
1   1 2018-01-08 18:07:02
2   1 2018-03-15 18:07:02
3   2 2018-01-01 18:07:02
4   2 2018-02-01 18:07:02
5   3 2018-01-01 18:07:02

>>> df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean())
ID
1   36 days 12:00:00
2   31 days 00:00:00
3                NaT
Name: Timestamp, dtype: timedelta64[ns]

If you want it as a dataframe with the column named Avg_Difference, just add to_frame at the end:

df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean()).to_frame('Avg_Difference')

     Avg_Difference
ID                 
1  36 days 12:00:00
2  31 days 00:00:00
3               NaT

Edit Based on your comment, if you want to remove the time element, and just get the number of days, you can do the following:

df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean()).dt.days.to_frame('Avg_Difference')

    Avg_Difference
ID                
1             36.0
2             31.0
3              NaN

Upvotes: 5

Related Questions