Pandas: assign mean values to a category

Question

My data includes invoices and customers. One customer can have multiple invoices. One invoice belongs to always one customer. The invoices are updated daily (Report Date).

I want to calculate the mean interval (see column below) of the Posting Dates between all Invoices for each customer.

To explain it more simply: I want to know the mean interval (in days) of each customer between the posting date of his/her invoices he/she receives.

Report Date  Invoice No   Customer No  Posting Date   Mean-Interval (days)
2018-08-14   A            1            2018-08-05     3,5
2018-08-14   B            1            2018-08-10     3,5
2018-08-14   C            2            2018-08-01     15

2018-08-15   A            1            2018-08-05     3,5       
2018-08-15   B            1            2018-08-10     3,5       
2018-08-15   C            2            2018-08-01     15

2018-08-16   C            2            2018-08-01     15
2018-08-16   D            1            2018-08-12     3,5
2018-08-16   E            2            2018-08-16     15

So Customer 1 received Invoice A on 2018-08-05, Invoice B on 2018-08-10 and Invoice D on 2018-08-12.

The mean interval between all received Invoice dates is: (5+2)/2 = 3,5 Days

The code I wrote calculates the mean intervals between the distances of posting dates (grouped by "Invoice no").

But it does only for one specific Customer No "A". However, I need this to be calculated for each customer and assigned to the table as shown above.

Technically I could solve this with a for-loop. But with about 20k Customers this would take too long.

dateMean = df[df["Customer No"] == "A"].sort_values(by='Posting Date').groupby('Invoice No', sort=False)\
        .agg({'Posting Date': 'first'})["Posting Date"]

dateMean.diff().mean()

Out: Timedelta('3 days 12:00:00')

Keep in mind, that it's possible that multiple Invoices can be created on the same date. This is why I always take the first date of an Invoice and compare it to an Invoice from another Posting Date.

ALollz · Accepted Answer

sort and drop_duplicates to get the unique invoices. Then we can calculate the average time difference and map the result back to the original.

import pandas as pd

#df['Report Date'] = pd.to_datetime(df['Report Date'])
#df['Posting Date'] = pd.to_datetime(df['Posting Date'])

cols = ['Customer No', 'Invoice No']
df1 = df.sort_values(cols).drop_duplicates(cols)

# `m` so diff is only within user. 
m = df1['Customer No'].eq(df1['Customer No'].shift())

# `s`: Series of average time diff, index is Customer No
s = df1['Posting Date'].diff().where(m).dt.days.groupby(df1['Customer No']).mean()

df['Mean-Interval (days)'] = df['Customer No'].map(s)

print(df)

  Report Date Invoice No  Customer No Posting Date  Mean-Interval (days)
0  2018-08-14          A            1   2018-08-05                   3.5
1  2018-08-14          B            1   2018-08-10                   3.5
2  2018-08-14          C            2   2018-08-01                  15.0
3  2018-08-15          A            1   2018-08-05                   3.5
4  2018-08-15          B            1   2018-08-10                   3.5
5  2018-08-15          C            2   2018-08-01                  15.0
6  2018-08-16          C            2   2018-08-01                  15.0
7  2018-08-16          D            1   2018-08-12                   3.5
8  2018-08-16          E            2   2018-08-16                  15.0

Pandas: assign mean values to a category

Answers (1)

Related Questions