Reputation: 145
I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned. Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.
Upvotes: 3
Views: 1154
Reputation: 27869
You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429
Upvotes: 0
Reputation: 18367
This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.
Upvotes: 0