Bartli
Bartli

Reputation: 315

pandas moving average by group calculation is wrong

My dataframe priceDF begins like this:

    Date        symbol  Close
0   2000-01-03  HELN.SW 28.28
1   2000-01-04  HELN.SW 27.50
2   2000-01-05  HELN.SW 26.71
3   2000-01-06  HELN.SW 27.16
4   2000-01-07  HELN.SW 27.50

To calculate the moving average of the Closes, grouped by symbol I did:

priceDF['avg'] = priceDF.groupby('symbol')'Close'].rolling(3).mean().reset_index(drop=True)

and I get:

    Date        symbol  Close   avg
0   2000-01-03  HELN.SW 28.28   NaN
1   2000-01-04  HELN.SW 27.50   NaN
2   2000-01-05  HELN.SW 26.71   12.537398
3   2000-01-06  HELN.SW 27.16   12.022164
4   2000-01-07  HELN.SW 27.50   11.922733

In row 2, I want avg = 27.50, the average of the Closes in rows 0 to 2. What do I misunderstand?

Upvotes: 1

Views: 945

Answers (1)

piterbarg
piterbarg

Reputation: 8219

I am pretty sure this is an issue with how reset_index is used here,in conjuction with having multiple groups. Let us consider a slightly extended example:

from io import StringIO

data = StringIO(
'''
   Date        symbol  Close
0   2000-01-03  HELN.SW 28.28
1   2000-01-04  HELN.SW 27.50
2   2000-01-05  HELN.SW 26.71
3   2000-01-06  HELN.SW 27.16
4   2000-01-07  HELN.SW 27.50
5   2000-01-07  AAAA.SW 30.00
''')
df = pd.read_csv(data, sep = '\s+', index_col=0)

(we have added AAAA.SW to the list)

Now this:

df.groupby('symbol')['Close'].rolling(3).mean()

produces sensible numbers (as there is only one date for AAAA we expect NaN:


symbol    
AAAA.SW  5          NaN
HELN.SW  0          NaN
         1          NaN
         2    27.496667
         3    27.123333
         4    27.123333
Name: Close, dtype: float64

but this:

df.groupby('symbol')['Close'].rolling(3).mean().reset_index(drop=True)

produces wrong indices

0          NaN
1          NaN
2          NaN
3    27.496667
4    27.123333
5    27.123333
Name: Close, dtype: float64

and when put into the original df end up in the wrong rows:

df['avg'] = df.groupby('symbol')['Close'].rolling(3).mean().reset_index(drop=True)
df

produces

     Date       symbol  Close   avg
0   2000-01-03  HELN.SW 28.28   NaN
1   2000-01-04  HELN.SW 27.50   NaN
2   2000-01-05  HELN.SW 26.71   NaN
3   2000-01-06  HELN.SW 27.16   27.496667
4   2000-01-07  HELN.SW 27.50   27.123333
5   2000-01-07  AAAA.SW 30.00   27.123333

a solution is to do the aqssignments within each group, like this:

df.groupby('symbol').apply(lambda g: g.assign(avg = g['Close'].rolling(3).mean())).reset_index(drop=True)

so we get


    Date        symbol  Close   avg
0   2000-01-07  AAAA.SW 30.00   NaN
1   2000-01-03  HELN.SW 28.28   NaN
2   2000-01-04  HELN.SW 27.50   NaN
3   2000-01-05  HELN.SW 26.71   27.496667
4   2000-01-06  HELN.SW 27.16   27.123333
5   2000-01-07  HELN.SW 27.50   27.123333

Upvotes: 1

Related Questions