Reputation: 543
I have a pandas df (called df2) like this:
id | orderdate |
___________________
123|2020-11-01 |
123|2020-08-01 |
233|2020-07-01 |
233|2020-11-04 |
444|2020-11-04 |
444|2020-05-03 |
444|2020-04-01 |
444|2020-11-25 |
The values of orderdate are datetime with the format '%Y%m%d'. They represent orders of a client. I want to calculate the delta time between the first order and the second one for each id (each client).
I come up with:
for i in list(set(df2.id)):
list_sorted=list(set((df2.loc[df2['id']==i, 'orderdate'] )))
list_sorted= sorted(list_sorted) #get sorted list of the order dates in ascending order
min_list= list_sorted[0] # first element is first order
df2.loc[df2['id']==i, 'First Order']= min_list
if len(list_sorted)>1:
penultimate_list= list_sorted[1]
df2.loc[df2['id']==i, 'Second Order']= penultimate_list # second element is second order
df2.loc[df2['id']==i, 'Delta orders']= min_list - penultimate_list #calculate delta
else:
df2.loc[df2['id_user']==i, 'Delta orders']= None
My expected outcome is:
id | orderdate | First Order | Second Order| Delta Orders
______________________________________________
123|2020-11-01 |2020-08-01 | 2020-11-01 | 92 days
123|2020-08-01 |2020-08-01 | 2020-11-01 | 92 days
233|2020-07-01 |2020-07-01 | 2020-11-04 | 126 days
233|2020-11-04 |2020-07-01 | 2020-11-04 | 126 days
444|2020-11-04 |2020-04-01 | 2020-05-03 | 32 days
444|2020-05-03 |2020-04-01 | 2020-05-03 | 32 days
444|2020-04-01 |2020-04-01 | 2020-05-03 | 32 days
444|2020-11-25 |2020-04-01 | 2020-05-03 | 32 days
It works but I feel like it's cumbersome. Any easier way to do it?
Upvotes: 0
Views: 86
Reputation: 16581
Slightly different from what you want, but it's a start:
import pandas as pd
from io import StringIO
data = StringIO(
"""id|orderdate
123|2020-11-01
123|2020-08-01
233|2020-07-01
233|2020-11-04
444|2020-11-04
444|2020-05-03
444|2020-04-01
444|2020-11-25 """)
df = pd.read_csv(data, sep='|')
df['orderdate'] = pd.to_datetime(df['orderdate'], infer_datetime_format=True)
df = df.sort_values(['id', 'orderdate'], ascending=False)
def date_diff(df):
df['order_time_diff'] = (df['orderdate'] - df['orderdate'].shift(-1)).dt.days
df = df.dropna()
return df
# this calculates all order differences
df.groupby('id').apply(date_diff)
# this will get the data as requested
df.groupby('id', as_index=False).apply(date_diff).groupby('id').tail(1)
Upvotes: 1