Reputation: 559
I have a dataframe containing weekly sales for different products (a, b, c). If there were zero sales in a given week (e.g. week 4), there is no record for that week:
In[1]
df = pd.DataFrame({'product': list('aaaabbbbcccc'),
'week': [1, 2, 3, 5, 1, 2, 3, 5, 1, 2, 3, 4],
'sales': np.power(2, range(12))})
Out[1]
product sales week
0 a 1 1
1 a 2 2
2 a 4 3
3 a 8 5
4 b 16 1
5 b 32 2
6 b 64 3
7 b 128 5
8 c 256 1
9 c 512 2
10 c 1024 3
11 c 2048 4
I would like to create a new column containing the cumulative sales for the previous n weeks, grouped by product. E.g. for n=2 it should be like last_2_weeks:
product sales week last_2_weeks
0 a 1 1 0
1 a 2 2 1
2 a 4 3 3
3 a 8 5 4
4 b 16 1 0
5 b 32 2 16
6 b 64 3 48
7 b 128 5 64
8 c 256 1 0
9 c 512 2 256
10 c 1024 3 768
11 c 2048 4 1536
If there was a record for every week, I could just use rolling_sum
as described in this question.
Is there a way to set 'week' as an index and only calculate the sum on that index? Or could I resample 'week' and set sales to zero for all missing rows?
Upvotes: 4
Views: 923
Reputation: 139
You can use pivot to create a table which will auto-fill the missing values. This works provided that there is at least one entry for each week in your original data, reindex can be used to ensure that there is a row in the table for every week.
This can then have rolling_sum
applied to it:
import pandas, numpy
df = pandas.DataFrame({'product': list('aaaabbbbcccc'),
'week': [1, 2, 3, 5, 1, 2, 3, 5, 1, 2, 3, 4],
'sales': numpy.power(2, range(12))})
sales = df.pivot(index='week', columns='product')
# Cope with weeks when there were no sales at all
sales = sales.reindex(range(min(sales.index), 1+max(sales.index))).fillna(0)
# Calculate the sum for the preceding two weeks
pandas.rolling_sum(sales, 3, min_periods=1)-sales
This gives the following result, which looks to match the desired (in that it provides the sum for the preceding two weeks):
product a b c
week
1 0 0 0
2 1 16 256
3 3 48 768
4 6 96 1536
5 4 64 3072
Upvotes: 0
Reputation: 862511
Resample is only valid with DatetimeIndex
, TimedeltaIndex
or PeriodIndex
.
But reindex
is possible with integers.
Firstly column week
is set to index. Then df is grouped by column product
and apply reindex by max values of index of each group. Missing values are filled by 0
.
import pandas as pd
import numpy as np
df = pd.DataFrame({'product': list('aaaabbbbcccc'),
'week': [1, 2, 3, 5, 1, 2, 3, 5, 1, 2, 3, 4],
'sales': np.power(2, range(12))})
df = df.set_index('week')
def reindex_by_max_index_of_group(df):
index = range(1, max(df.index) + 1)
return df.reindex(index, fill_value=0)
df = df.groupby('product').apply(reindex_by_max_index_of_group)
df.drop(['product'], inplace=True, axis=1)
print df.reset_index()
# product week sales
#0 a 1 1
#1 a 2 2
#2 a 3 4
#3 a 4 0
#4 a 5 8
#5 b 1 16
#6 b 2 32
#7 b 3 64
#8 b 4 0
#9 b 5 128
#10 c 1 256
#11 c 2 512
#12 c 3 1024
#13 c 4 2048
Upvotes: 1