Reputation: 632
I have a dataframe as below:
data = pd.DataFrame({'Date':['20191001','20191002','20191003','20191004','20191005','20191006','20191001','20191002','20191003','20191004','20191005','20191006'],'Store':['A','A','A','A','A','A','B','B','B','B','B','B'],'Sale':[1,2,8,6,9,0,4,3,0,2,3,7]})
what I want to do is to calculate moving average for each store for 2 previous days (window size = 2) and put the value in a new column (let's say 'MA'), but the problem is that I want this window rolls over the actual sale and the previous calculated MA. the below image is the explanation:
Sorry that I had to articulate my issue with picture :|
I know I have to group by store and I can use the rolling(2) but the method would calculate the moving average on one column only.
My original window is 15 and the above is just an example.
Any help is kindly appreciated.
Upvotes: 1
Views: 280
Reputation: 2720
I cannot quite think of a way to solve this without writing custom code for the problem, as you are using data that you are generating the step before. The snippet below is what I came up with. It operates in linear time, which I believe is as good as you can get, mostly operates inplace, only requiring extra storage for a pd.Series of length window
, does minimal copying, only looking at each value once and it can be used with arbitrary window sizes, making it straightforward to extend to your real use case
def fill_ma(sales: pd.Series, window: int):
# "manually" do the first steps on the sales data
iter_data = sales.iloc[0:window]
for i in range(window):
iter_data.iloc[i] = np.mean(iter_data)
sales.iloc[0:window] = np.nan
sales.iloc[window:(2 * window)] = iter_data.values
# loop over the rest of the Series and compute the moving average of MA data
for i in range(2 * window, sales.shape[0]):
tmp = np.mean(iter_data)
iter_data.iloc[i % window] = tmp
sales.iloc[i] = tmp
return sales
Using this function is pretty simple: groupby
the Store column and apply
the function just like so:
window = 2
data.groupby('Store')['Sale'].apply(lambda x: fill_ma(x, window))
0 NaN
1 NaN
2 1.5000
3 1.7500
4 1.6250
5 1.6875
6 NaN
7 NaN
8 3.5000
9 3.2500
10 3.3750
11 3.3125
Name: Sale, dtype: float64
If you end up using this on significant amounts of real data, I'd be interested to hear how it performed regarding runtime. Cheers
Upvotes: 1