Reputation: 375
I am working on a transactions dataset and I need to figure out the average time between purchases for each client.
I managed to get the diff between the latest date and the earliest (in months) and divide by the total purchases (NumPurchases). But I am not sure of this approach as it does not take into consideration the fact that not every only bought on multiple occasions.
Imagine the following dataset how would you extract the average time between purchases.
CustomerID EarliestSale LatestSale NumPurchases
0 1 2017-01-05 2017-12-23 11
1 10 2017-06-20 2017-11-17 5
2 100 2017-05-10 2017-12-19 2
3 1000 2017-02-19 2017-12-30 9
4 1001 2017-02-07 2017-11-18 7
Apologies for the rookie question in advance and thanks StackOverflow community :).
Upvotes: 1
Views: 1058
Reputation: 942
Given your revised question and initial dataset (I've revised your dataset slightly to include two customers):
df = pd.DataFrame({'CustomerId': ['001', '001', '002', '002'],
'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10', '2017-09-10'],
'Quantity': [5, 1, 1, 6]})
You can easily include the average time between transactions (in days) in your group by with the following code:
NOTE: This will only work if you dataset is ordered by CustomerID then SaleDate.
import pandas as pd
df = pd.DataFrame({'CustomerId': ['001', '001', '002', '002'],
'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10', '2017-09-10'],
'Quantity': ['5', '1', '1', '6']})
# convert the string date to a datetime
df['SaleDate'] = pd.to_datetime(df.SaleDate)
# sort the dataset
df = df.sort_values(['CustomerId', 'SaleDate'])
# calculate the difference between each date in days
# (the .shift method will offset the rows, play around with this to understand how it works
# - We apply this to every customer using a groupby
df2 = df.groupby("CustomerId").apply(
lambda df: (df.SaleDate - df.SaleDate.shift(1)).dt.days).reset_index()
df2 = df2.rename(columns={'SaleDate': 'time-between-sales'})
df2.index = df2.level_1
# then join the result back on to the original dataframe
df = df.join(df2['time-between-sales'])
# add the mean time to your groupby
grouped = df.groupby("CustomerId").agg({
"SaleDate": ["min", "max"],
"Quantity": "sum",
"time-between-sales": "mean"})
# rename columns per your original specification
grouped.columns = grouped.columns.get_level_values(0) + grouped.columns.get_level_values(1)
grouped = grouped.rename(columns={
'SaleDatemin': 'EarliestSale',
'SaleDatemax': 'LatestSale',
'Quantitysum': 'NumPurchases',
'time-between-salesmean': 'avgTimeBetweenPurchases'})
print(grouped)
EarliestSale LatestSale NumPurchases avgTimeBetweenPurchases
CustomerId
001 2017-01-10 2017-04-10 6 90.0
002 2017-08-10 2017-09-10 7 31.0
Upvotes: 4