Reputation: 117
My data includes invoices and customers. One customer can have multiple invoices. One invoice belongs to always one customer. The invoices are updated daily (Report Date).
I want to calculate the mean interval (see column below) of the Posting Dates between all Invoices for each customer.
To explain it more simply: I want to know the mean interval (in days) of each customer between the posting date of his/her invoices he/she receives.
Report Date Invoice No Customer No Posting Date Mean-Interval (days)
2018-08-14 A 1 2018-08-05 3,5
2018-08-14 B 1 2018-08-10 3,5
2018-08-14 C 2 2018-08-01 15
2018-08-15 A 1 2018-08-05 3,5
2018-08-15 B 1 2018-08-10 3,5
2018-08-15 C 2 2018-08-01 15
2018-08-16 C 2 2018-08-01 15
2018-08-16 D 1 2018-08-12 3,5
2018-08-16 E 2 2018-08-16 15
So Customer 1 received Invoice A on 2018-08-05, Invoice B on 2018-08-10 and Invoice D on 2018-08-12.
The mean interval between all received Invoice dates is: (5+2)/2 = 3,5 Days
The code I wrote calculates the mean intervals between the distances of posting dates (grouped by "Invoice no").
But it does only for one specific Customer No "A". However, I need this to be calculated for each customer and assigned to the table as shown above.
Technically I could solve this with a for-loop. But with about 20k Customers this would take too long.
dateMean = df[df["Customer No"] == "A"].sort_values(by='Posting Date').groupby('Invoice No', sort=False)\
.agg({'Posting Date': 'first'})["Posting Date"]
dateMean.diff().mean()
Out: Timedelta('3 days 12:00:00')
Keep in mind, that it's possible that multiple Invoices can be created on the same date. This is why I always take the first date of an Invoice and compare it to an Invoice from another Posting Date.
Upvotes: 2
Views: 150
Reputation: 59549
sort
and drop_duplicates
to get the unique invoices. Then we can calculate the average time difference and map the result back to the original.
import pandas as pd
#df['Report Date'] = pd.to_datetime(df['Report Date'])
#df['Posting Date'] = pd.to_datetime(df['Posting Date'])
cols = ['Customer No', 'Invoice No']
df1 = df.sort_values(cols).drop_duplicates(cols)
# `m` so diff is only within user.
m = df1['Customer No'].eq(df1['Customer No'].shift())
# `s`: Series of average time diff, index is Customer No
s = df1['Posting Date'].diff().where(m).dt.days.groupby(df1['Customer No']).mean()
df['Mean-Interval (days)'] = df['Customer No'].map(s)
print(df)
Report Date Invoice No Customer No Posting Date Mean-Interval (days)
0 2018-08-14 A 1 2018-08-05 3.5
1 2018-08-14 B 1 2018-08-10 3.5
2 2018-08-14 C 2 2018-08-01 15.0
3 2018-08-15 A 1 2018-08-05 3.5
4 2018-08-15 B 1 2018-08-10 3.5
5 2018-08-15 C 2 2018-08-01 15.0
6 2018-08-16 C 2 2018-08-01 15.0
7 2018-08-16 D 1 2018-08-12 3.5
8 2018-08-16 E 2 2018-08-16 15.0
Upvotes: 2