How to obtain difference of a date column in groupby

Question

Currently my data looks like :

user_ID order_number    order_start_date    order_value week_day
237 135950  1594878.0   2018-01-01  534.0   Monday
235 32911   1594942.0   2018-01-01  89.0    Monday
232 208474  1594891.0   2018-01-01  85.0    Monday
231 9048    1594700.0   2018-01-01  224.0   Monday
228 134896  1594633.0   2018-01-01  449.0   Monday

What I want to achieve is groupby the records by user_ID and take difference of min and max value of each date and find out difference between them in days. Where I am struggling:

Groupby does not inherently supports minimum maximum difference
It is not possible to perform numerical operations such as mean() on datetime series which exist as a column in a dataframe. Though possible for individual series. Any help?

Tom · Accepted Answer

I feel like your description was practically the pseudocode!

output = df.groupby('user_ID')['order_start_date'].apply(lambda g: g.max()-g.min())

You can then get the difference in days as numbers (rather than timedeltas):

output = [i / pd.Timedelta(days=1) for i in output]

The output on your example data is all 0 because there is only one entry per user, this is what you expect yes?

As for taking the mean, you just need to represent the dates as seconds since some time and then take the average. I had tried to convert all to timedeltas since an old time and then average, but this post does it better and works well with groupby. Here's a test scenario where its all data for one userID and the dates go from Jan 1st to Jan 5th, 2020:

df.loc[:,'user_ID'] = 1111
df['order_start_date'] = pd.date_range('01-01-2020','01-05-2020',periods=5)
df['order_start_date'] = np.array(df['order_start_date'],dtype='datetime64[s]').view('i8')
output = df.groupby('user_ID')['order_start_date'].mean().astype('datetime64[s]')

Results:

user_ID
1111   2020-01-03

How to obtain difference of a date column in groupby

Answers (1)

Related Questions