Reputation: 1057
I have following dataframe
+-----------------------------+
| app_id mail_dt |
+-----------------------------+
| 100255 12000017 2009-03-23 |
| 129628 12000017 2009-09-04 |
| 183069 12000017 2010-03-19 |
| 125231 12000031 2009-08-18 |
| 204245 12000031 2010-05-12 |
| 296410 12000031 2010-10-06 |
| 183030 12000044 2010-03-17 |
| 291704 12000044 2010-09-29 |
| 635932 12000044 2011-09-23 |
+-----------------------------+
I need to calculate consecutive datetime diff for same app_id
and then calculate mean value for corresponding mail_dt
difference
Say for app_id = 12000017 it will be the following
df.loc[129628,'mail_dt'] - df.loc[100255,'mail_dt']
Out[25]: Timedelta('165 days 00:00:00')
df.loc[183069,'mail_dt'] - df.loc[129628,'mail_dt']
Out[26]: Timedelta('196 days 00:00:00')
And the mean value will be 180.5 days (let's floor it to 180)
So, my question is how to calculate this mean value for each app_id
and construct following dataframe
+----------+------------+
| app_id | mean_delta |
+----------+------------+
| 12000017 | 180 |
| 12000031 | 207 |
+----------+------------+
Of course, one can iterate through all app_id
and calculate mean timedelta values but i wonder if there is a beautiful solution based on groupby and other functions. Thanks in advance
Upvotes: 2
Views: 43
Reputation: 7635
Another option is to use pandas.DataFrame.groupby()
in combination with pandas.Series.shift()
. The code is following:
>>> df_agg = df.groupby(['app_id'])['mail_dt']\
... .agg(mean_delta=lambda x: (pd.to_datetime(x) - pd.to_datetime(x.shift())).mean().days)\
... .reset_index()
>>> print(df_agg)
app_id mean_delta
0 12000017 180
1 12000031 207
2 12000044 277
Upvotes: 0
Reputation: 863681
Idea is get differences per groups by GroupBy.agg
with Series.diff
and mean
, convert timedeltas to days by Series.dt.days
and last to 2 columns DataFrame by Series.reset_index
:
#if necessary
df['mail_dt'] = pd.to_datetime(df['mail_dt'])
df1 = (df.groupby('app_id')['mail_dt']
.agg(lambda x: x.diff().mean())
.dt.days
.reset_index(name='avg'))
print (df1)
app_id avg
0 12000017 180
1 12000031 207
2 12000044 277
Upvotes: 2