KirklandShawty
KirklandShawty

Reputation: 263

Groupby first two earliest dates, then average time between first two dates - pandas

I'm hoping to groupby users and find the first two uploads. I've figured out how to get the first date via minimum, but I'm having trouble getting that second upload date. Then would like to get the average time between the two upload dates for all users.

df:

Date_Uploaded  User_ID  Display_Status
2018-10-27     abc123   Cleared
2018-10-28     abc123   Cleared
2018-10-29     abc123   Pending
2018-09-21     abc123   Pending
2018-08-24     efg123   Pending
2018-08-01     efg123   Pending
2018-07-25     efg123   Pending

Upvotes: 0

Views: 77

Answers (3)

ALollz
ALollz

Reputation: 59549

sort, calculate the difference and then groupby + nth(1) to get the difference between the first uploads, if it exists (users with 1 date will not show up).

import pandas as pd

df['Date_Uploaded'] = pd.to_datetime(df.Date_Uploaded)
df = df.sort_values(['User_ID', 'Date_Uploaded'])

df.Date_Uploaded.diff().groupby(df.User_ID).nth(1)

#User_ID
#abc123   36 days
#efg123    7 days
#Name: Date_Uploaded, dtype: timedelta64[ns]

If you just want the average then average that series:

df.Date_Uploaded.diff().groupby(df.User_ID).nth(1).mean()
#Timedelta('21 days 12:00:00')

Upvotes: 0

paulo.filip3
paulo.filip3

Reputation: 3297

Since the other answers explain pretty well how to achieve this, I'll give you a one-liner for a change

 In [1]: df.groupby('User_ID').apply(lambda g: g.sort_values('Date_Uploaded')['Date_Uploaded'][:2].diff()).mean()
 Out[1]: Timedelta('21 days 12:00:00')

Upvotes: 0

BENY
BENY

Reputation: 323276

Using sort_values + head

df.sort_values('Date_Uploaded').groupby('User_ID').head(2)
Out[152]: 
  Date_Uploaded User_ID Display_Status
6    2018-07-25  efg123        Pending
5    2018-08-01  efg123        Pending
3    2018-09-21  abc123        Pending
0    2018-10-27  abc123        Cleared

Upvotes: 2

Related Questions