C4be
C4be

Reputation: 13

Calculate a rolling window weighted average on a Pandas column

I'm relatively new to python, and have been trying to calculate some simple rolling weighted averages across rows in a pandas data frame. I have a dataframe of observations df and a dataframe of weights w. I create a new dataframe to hold the inner-product between these two sets of values, dot.

As w is of smaller dimension, I use a for loop to calculate the weighted average by row, of the leading rows equal to the length of w.

More clearly, my set-up is as follows:

import pandas as pd

df = pd.DataFrame([0,1,2,3,4,5,6,7,8], index = range(0,9))

w = pd.DataFrame([0.1,0.25,0.5], index = range(0,3))

dot = pd.DataFrame(0, columns = ['dot'], index = df.index)

for i in range(0,len(df)):
    df.loc[i] = sum(df.iloc[max(1,(i-3)):i].values * w.iloc[-min(3,(i-1)):4].values) 

I would expect the result to be as follows (i.e. when i = 4)

dot.loc[4] = sum(df.iloc[max(1,(4-3)):4].values * w.iloc[-min(3,(4-1)):4].values)
print dot.loc[4]  #2.1

However, when running the for loop above, I receive the error:

ValueError: operands could not be broadcast together with shapes (0,1) (2,1)

Which is where I get confused - I think it must have to do with how I call i into iloc, as I don't receive shape errors when I manually calculate it, as in the example with 4 above. However, looking at other examples and documentation, I don't see why that's the case... Any help is appreciated.

Upvotes: 1

Views: 2762

Answers (1)

Michoel Snow
Michoel Snow

Reputation: 373

Your first problem is that you are trying to multiply arrays of two different sizes. For example, when i=0 the different parts of your for loop return

df.iloc[max(1,(0-3)):0].values.shape
# (0,1)

w.iloc[-min(3,(0-1)):4].values.shape    
# (2,1)

Which is exactly the error you are getting. The easiest way I can think of to make the arrays multipliable is to pad your dataframe with leading zeros, using concatenation.

df2 = pd.concat([pd.Series([0,0]),df], ignore_index=True)
df2
    0
0   0
1   0
2   0
3   1
4   2
5   3
6   4
7   5
8   6
9   7
10  8

While you can now use your for loop (with some minor tweaking):

for i in range(len(df)):
    dot.loc[i] = sum(df2.iloc[max(0,(i)):i+3].values * w.values)

A nicer way might be the way JohnE suggested, to use the rolling and apply functions built into pandas, there by getting rid of your for loop

import numpy as np
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w))
    0
0   NaN
1   NaN
2   0.00
3   0.50
4   1.25
5   2.10
6   2.95
7   3.80
8   4.65
9   5.50
10  6.35

You can also drop the first two padding rows and reset the index

df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w)).drop([0,1]).reset_index(drop=True)
    0
0   0.00
1   0.50
2   1.25
3   2.10
4   2.95
5   3.80
6   4.65
7   5.50
8   6.35

Upvotes: 3

Related Questions