Reputation: 6025
Currently my data looks like :
user_ID order_number order_start_date order_value week_day
237 135950 1594878.0 2018-01-01 534.0 Monday
235 32911 1594942.0 2018-01-01 89.0 Monday
232 208474 1594891.0 2018-01-01 85.0 Monday
231 9048 1594700.0 2018-01-01 224.0 Monday
228 134896 1594633.0 2018-01-01 449.0 Monday
What I want to achieve is groupby the records by user_ID and take difference of min and max value of each date and find out difference between them in days. Where I am struggling:
Upvotes: 0
Views: 38
Reputation: 8790
I feel like your description was practically the pseudocode!
output = df.groupby('user_ID')['order_start_date'].apply(lambda g: g.max()-g.min())
You can then get the difference in days as numbers (rather than timedeltas):
output = [i / pd.Timedelta(days=1) for i in output]
The output on your example data is all 0 because there is only one entry per user, this is what you expect yes?
As for taking the mean, you just need to represent the dates as seconds since some time and then take the average. I had tried to convert all to timedelta
s since an old time and then average, but this post does it better and works well with groupby
. Here's a test scenario where its all data for one userID and the dates go from Jan 1st to Jan 5th, 2020:
df.loc[:,'user_ID'] = 1111
df['order_start_date'] = pd.date_range('01-01-2020','01-05-2020',periods=5)
df['order_start_date'] = np.array(df['order_start_date'],dtype='datetime64[s]').view('i8')
output = df.groupby('user_ID')['order_start_date'].mean().astype('datetime64[s]')
Results:
user_ID
1111 2020-01-03
Upvotes: 1