Anton
Anton

Reputation: 1057

Date difference for dataframe column

I have following dataframe

+-----------------------------+
|           app_id    mail_dt |
+-----------------------------+
| 100255  12000017 2009-03-23 |
| 129628  12000017 2009-09-04 |
| 183069  12000017 2010-03-19 |
| 125231  12000031 2009-08-18 |
| 204245  12000031 2010-05-12 |
| 296410  12000031 2010-10-06 |
| 183030  12000044 2010-03-17 |
| 291704  12000044 2010-09-29 |
| 635932  12000044 2011-09-23 |
+-----------------------------+

I need to calculate consecutive datetime diff for same app_id and then calculate mean value for corresponding mail_dt difference

Say for app_id = 12000017 it will be the following

df.loc[129628,'mail_dt'] - df.loc[100255,'mail_dt']
Out[25]: Timedelta('165 days 00:00:00')

df.loc[183069,'mail_dt'] - df.loc[129628,'mail_dt']
Out[26]: Timedelta('196 days 00:00:00')

And the mean value will be 180.5 days (let's floor it to 180)

So, my question is how to calculate this mean value for each app_id and construct following dataframe

+----------+------------+
|  app_id  | mean_delta |
+----------+------------+
| 12000017 |        180 |
| 12000031 |        207 |
+----------+------------+

Of course, one can iterate through all app_id and calculate mean timedelta values but i wonder if there is a beautiful solution based on groupby and other functions. Thanks in advance

Upvotes: 2

Views: 43

Answers (2)

Jaroslav Bezděk
Jaroslav Bezděk

Reputation: 7635

Another option is to use pandas.DataFrame.groupby() in combination with pandas.Series.shift(). The code is following:

>>> df_agg = df.groupby(['app_id'])['mail_dt']\
...     .agg(mean_delta=lambda x: (pd.to_datetime(x) - pd.to_datetime(x.shift())).mean().days)\
...     .reset_index()
>>> print(df_agg)
     app_id  mean_delta
0  12000017         180
1  12000031         207
2  12000044         277

Upvotes: 0

jezrael
jezrael

Reputation: 863681

Idea is get differences per groups by GroupBy.agg with Series.diff and mean, convert timedeltas to days by Series.dt.days and last to 2 columns DataFrame by Series.reset_index:

#if necessary
df['mail_dt'] = pd.to_datetime(df['mail_dt'])

df1 = (df.groupby('app_id')['mail_dt']
         .agg(lambda x: x.diff().mean())
         .dt.days
         .reset_index(name='avg'))
print (df1)
     app_id  avg
0  12000017  180
1  12000031  207
2  12000044  277

Upvotes: 2

Related Questions