Youssef Razak
Youssef Razak

Reputation: 375

Computing average time between sales

I am working on a transactions dataset and I need to figure out the average time between purchases for each client.

I managed to get the diff between the latest date and the earliest (in months) and divide by the total purchases (NumPurchases). But I am not sure of this approach as it does not take into consideration the fact that not every only bought on multiple occasions.

Imagine the following dataset how would you extract the average time between purchases.

    CustomerID  EarliestSale    LatestSale      NumPurchases    
0   1             2017-01-05    2017-12-23                11    
1   10            2017-06-20    2017-11-17                 5    
2   100           2017-05-10    2017-12-19                 2    
3   1000          2017-02-19    2017-12-30                 9    
4   1001          2017-02-07    2017-11-18                 7    

Apologies for the rookie question in advance and thanks StackOverflow community :).

Upvotes: 1

Views: 1058

Answers (1)

robertwest
robertwest

Reputation: 942

Given your revised question and initial dataset (I've revised your dataset slightly to include two customers):

df = pd.DataFrame({'CustomerId': ['001', '001', '002', '002'], 
                   'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10', '2017-09-10'], 
                   'Quantity': [5, 1, 1, 6]})


You can easily include the average time between transactions (in days) in your group by with the following code:

NOTE: This will only work if you dataset is ordered by CustomerID then SaleDate.

import pandas as pd
df = pd.DataFrame({'CustomerId': ['001', '001', '002', '002'], 
                   'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10', '2017-09-10'], 
                   'Quantity': ['5', '1', '1', '6']})

# convert the string date to a datetime 
df['SaleDate'] = pd.to_datetime(df.SaleDate)

# sort the dataset
df = df.sort_values(['CustomerId', 'SaleDate'])

# calculate the difference between each date in days 
# (the .shift method will offset the rows, play around with this to understand how it works
# - We apply this to every customer using a groupby 
df2 = df.groupby("CustomerId").apply(
    lambda df: (df.SaleDate - df.SaleDate.shift(1)).dt.days).reset_index()
df2 = df2.rename(columns={'SaleDate': 'time-between-sales'})
df2.index = df2.level_1

# then join the result back on to the original dataframe
df = df.join(df2['time-between-sales'])

# add the mean time to your groupby 
grouped = df.groupby("CustomerId").agg({
    "SaleDate": ["min", "max"], 
    "Quantity": "sum", 
    "time-between-sales": "mean"})

# rename columns per your original specification 
grouped.columns = grouped.columns.get_level_values(0) + grouped.columns.get_level_values(1) 
grouped = grouped.rename(columns={
    'SaleDatemin': 'EarliestSale',
    'SaleDatemax': 'LatestSale',
    'Quantitysum': 'NumPurchases',
    'time-between-salesmean': 'avgTimeBetweenPurchases'})
print(grouped)

           EarliestSale LatestSale NumPurchases  avgTimeBetweenPurchases
CustomerId
001          2017-01-10 2017-04-10            6                     90.0
002          2017-08-10 2017-09-10            7                     31.0

Upvotes: 4

Related Questions