Reputation: 143
I have a dataframe like the following:
id_cliente id_ordine data_ordine id_medium
0 madinside IML-0042758 2016-08-23 1190408
1 lisbeth19 IML-0071225 2017-02-26 1205650
2 lisbeth19 IML-0072944 2017-03-15 1207056
3 lisbeth19 IML-0077676 2017-05-12 1211395
4 lisbeth19 IML-0077676 2017-05-12 1207056
5 madinside IML-0094979 2017-09-29 1222195
6 lisbeth19 IML-0099675 2017-11-15 1211446
7 lisbeth19 IML-0099690 2017-11-15 1225212
8 lisbeth19 IML-0101439 2017-12-02 1226511
9 lisbeth19 IML-0109883 2018-03-14 1226511
I would like to add three columns:
the first column could be named "number of order per client" and should be the progression of orders made by the same client.
So order IML-0042758 should be 1, IML-0071225 should be 1, IML-0072944 should be 2, IML-0077676 should be 3, IML-0094979 should be 2, and so on..
the second column could be named "days between first and n order of the same client" and shows the the "data_ordine" difference (a datetime column) between the different orders made by the same client.
So the values for the first 6 rows would be: 0 (2016-08-23 - 2016-08-23), 0 (2017-02-26 - 2017-02-26), 17 (2017-03-15 - 2017-02-26), 75 (2017-05-12 - 2017-02-26), 75 (2017-05-12 - 2017-02-26), 402 (2017-09-29 - 2017-02-26).
the third column could be named "days between first and n order of the same id_medium" and shows the the "data_ordine" difference (a datetime column) between the different orders per id_medium.
So the values for the first 6 rows would be: 0 (2016-08-23 - 2016-08-23), 0 (2017-02-26 - 2017-02-26), 0 (2017-03-15 - 2017-03-15), 0 (2017-05-12 - 2017-05-12), 58 (2017-05-12 - 2017-03-15 because the medium "1207056" is ordered for the second time), 0 (2017-09-29 - 2017-09-29).
In the end I would like to calculate how long it takes in average for a client to make a second order, a third order, a fourth order and so on.
And how long it takes in average for a client to make a second, third (etc.) order for the same id_medium.
Upvotes: 4
Views: 761
Reputation: 59549
First convert to datetime
and sort
so the calculations are reliable.
groupby
+ ngroup
to label each order, then we subtract the min from each person so they all start from 1groupby
+ transform
to get the first date of each client then subtractid_medium
to the groupingdf['data_ordine'] = pd.to_datetime(df['data_ordine'])
df = df.sort_values('data_ordine')
df['Num_ords'] = df.groupby(['id_cliente', 'id_ordine']).ngroup()
df['Num_ords'] = df.Num_ords - df.groupby(['id_cliente']).Num_ords.transform('min')+1
df['days_bet'] = (df.data_ordine -df.groupby('id_cliente').data_ordine.transform('min')).dt.days
df['days_bet_id'] = (df.data_ordine - df.groupby(['id_cliente', 'id_medium']).data_ordine.transform('min')).dt.days
id_cliente id_ordine data_ordine id_medium Num_ords days_bet days_bet_id
0 madinside IML-0042758 2016-08-23 1190408 1 0 0
1 lisbeth19 IML-0071225 2017-02-26 1205650 1 0 0
2 lisbeth19 IML-0072944 2017-03-15 1207056 2 17 0
3 lisbeth19 IML-0077676 2017-05-12 1211395 3 75 0
4 lisbeth19 IML-0077676 2017-05-12 1207056 3 75 58
5 madinside IML-0094979 2017-09-29 1222195 2 402 0
6 lisbeth19 IML-0099675 2017-11-15 1211446 4 262 0
7 lisbeth19 IML-0099690 2017-11-15 1225212 5 262 0
8 lisbeth19 IML-0101439 2017-12-02 1226511 6 279 0
9 lisbeth19 IML-0109883 2018-03-14 1226511 7 381 102
Upvotes: 2