Reputation: 263
I'm hoping to groupby users and find the first two uploads. I've figured out how to get the first date via minimum, but I'm having trouble getting that second upload date. Then would like to get the average time between the two upload dates for all users.
df:
Date_Uploaded User_ID Display_Status
2018-10-27 abc123 Cleared
2018-10-28 abc123 Cleared
2018-10-29 abc123 Pending
2018-09-21 abc123 Pending
2018-08-24 efg123 Pending
2018-08-01 efg123 Pending
2018-07-25 efg123 Pending
Upvotes: 0
Views: 77
Reputation: 59549
sort
, calculate the difference and then groupby
+ nth(1)
to get the difference between the first uploads, if it exists (users with 1 date will not show up).
import pandas as pd
df['Date_Uploaded'] = pd.to_datetime(df.Date_Uploaded)
df = df.sort_values(['User_ID', 'Date_Uploaded'])
df.Date_Uploaded.diff().groupby(df.User_ID).nth(1)
#User_ID
#abc123 36 days
#efg123 7 days
#Name: Date_Uploaded, dtype: timedelta64[ns]
If you just want the average then average that series:
df.Date_Uploaded.diff().groupby(df.User_ID).nth(1).mean()
#Timedelta('21 days 12:00:00')
Upvotes: 0
Reputation: 3297
Since the other answers explain pretty well how to achieve this, I'll give you a one-liner for a change
In [1]: df.groupby('User_ID').apply(lambda g: g.sort_values('Date_Uploaded')['Date_Uploaded'][:2].diff()).mean()
Out[1]: Timedelta('21 days 12:00:00')
Upvotes: 0
Reputation: 323276
Using sort_values
+ head
df.sort_values('Date_Uploaded').groupby('User_ID').head(2)
Out[152]:
Date_Uploaded User_ID Display_Status
6 2018-07-25 efg123 Pending
5 2018-08-01 efg123 Pending
3 2018-09-21 abc123 Pending
0 2018-10-27 abc123 Cleared
Upvotes: 2