mpy
mpy

Reputation: 632

moving average on previous calculated moving average

I have a dataframe as below:

data = pd.DataFrame({'Date':['20191001','20191002','20191003','20191004','20191005','20191006','20191001','20191002','20191003','20191004','20191005','20191006'],'Store':['A','A','A','A','A','A','B','B','B','B','B','B'],'Sale':[1,2,8,6,9,0,4,3,0,2,3,7]})

data

what I want to do is to calculate moving average for each store for 2 previous days (window size = 2) and put the value in a new column (let's say 'MA'), but the problem is that I want this window rolls over the actual sale and the previous calculated MA. the below image is the explanation: calculations

Sorry that I had to articulate my issue with picture :|

I know I have to group by store and I can use the rolling(2) but the method would calculate the moving average on one column only.

My original window is 15 and the above is just an example.

Any help is kindly appreciated.

Upvotes: 1

Views: 280

Answers (1)

Lukas Thaler
Lukas Thaler

Reputation: 2720

I cannot quite think of a way to solve this without writing custom code for the problem, as you are using data that you are generating the step before. The snippet below is what I came up with. It operates in linear time, which I believe is as good as you can get, mostly operates inplace, only requiring extra storage for a pd.Series of length window, does minimal copying, only looking at each value once and it can be used with arbitrary window sizes, making it straightforward to extend to your real use case

def fill_ma(sales: pd.Series, window: int):
    # "manually" do the first steps on the sales data
    iter_data = sales.iloc[0:window]
    for i in range(window):
        iter_data.iloc[i] = np.mean(iter_data)

    sales.iloc[0:window] = np.nan
    sales.iloc[window:(2 * window)] = iter_data.values
    # loop over the rest of the Series and compute the moving average of MA data
    for i in range(2 * window, sales.shape[0]):
        tmp = np.mean(iter_data)
        iter_data.iloc[i % window] = tmp
        sales.iloc[i] = tmp

    return sales

Using this function is pretty simple: groupby the Store column and apply the function just like so:

window = 2
data.groupby('Store')['Sale'].apply(lambda x: fill_ma(x, window))

0        NaN
1        NaN
2     1.5000
3     1.7500
4     1.6250
5     1.6875
6        NaN
7        NaN
8     3.5000
9     3.2500
10    3.3750
11    3.3125
Name: Sale, dtype: float64

If you end up using this on significant amounts of real data, I'd be interested to hear how it performed regarding runtime. Cheers

Upvotes: 1

Related Questions