DataFrame: Moving average with rolling, mean and shift while ignoring NaN

I have a data set, let's say, 420x1. Now I would to calculate the moving average of the past 30 days, excluding the current date.

If I do the following:

df.rolling(window = 30).mean().shift(1)

my df results in a window with lots of NaNs, which is probably caused by NaNs in the original dataframe here and there (1 NaN within the 30 data points results the MA to be NaN).

Is there a method that ignores NaN (avoiding apply-method, I run it on large data so performance is key)? I do not want to replace the value with 0 because that could skew the results.

the same applies than to moving standard deviation.

Upvotes: 3

Answers (3)

BENY

Reputation: 323276

For example you can adding min_periods, and NaN is gone

df=pd.DataFrame({'A':[1,2,3,np.nan,2,3,4,np.nan]})
df.A.rolling(window=2,min_periods=1).mean()

Out[7]: 
0    1.0
1    1.5
2    2.5
3    3.0
4    2.0
5    2.5
6    3.5
7    4.0
Name: A, dtype: float64

Upvotes: 7

piRSquared

Reputation: 294288

Option 1

df.dropna().rolling('30D').mean()

Option 2

df.interpolate('index').rolling('30D').mean()

Option 2.5

df.interpolate('index').rolling(30).mean()

Option 3

s.rolling('30D').apply(np.nanmean)

Option 3.5

df.rolling(30).apply(np.nanmean)

Upvotes: 3

R.yan

Reputation: 2372

You can try dropna() to remove the nan values or fillna() to replace the nan with specific value.

Or you can filter out all nan value by notnull() or isnull() within your operation.

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three'])
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df2
        one       two     three
a  0.434024 -0.749472 -1.393307
b       NaN       NaN       NaN
c  0.897861  0.032307 -0.602912
d       NaN       NaN       NaN
e -1.056938 -0.129128  1.328862
f -0.581842 -0.682375 -0.409072
g       NaN       NaN       NaN
h -1.772906 -1.342019 -0.948151

df3 = df2[df2['one'].notnull()]
# use ~isnull() would return the same result
# df3 = df2[~df2['one'].isnull()]
print df3
    one       two     three
a  0.434024 -0.749472 -1.393307
c  0.897861  0.032307 -0.602912
e -1.056938 -0.129128  1.328862
f -0.581842 -0.682375 -0.409072
h -1.772906 -1.342019 -0.948151

For further reference, Pandas has a clean documentary about handling missing data(read this).

Upvotes: 2