rolling_mean instability in pandas

Question

I'm performing an upgrade from our current environment (Python 2.7.3 64-bit, pandas 0.9) to a new one (Python 2.7.6, pandas 0.14.1) and some of my regression tests are failing. I tracked it down to the behavior of pandas.stats.moments.rolling_mean

Here is a sample to reproduce the error:

import pandas as pd
data = [
    1.0,
    0.99997000000000003,
    0.99992625131299995,
    0.99992500140499996,
    0.99986125618599997,
    0.99981126312299995,
    0.99976377208800005,
    0.99984375318999996]
ser = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'))

print "rolling mean: %.17f" % pd.stats.moments.rolling_mean(ser, window=5, min_periods=1)['2008-06-06']
print "sum divide:   %.17f" % (ser['2008-6-1':'2008-6-6'].sum()/5)

In my original environment, I get the following output:

rolling mean: 0.99984100919839991                                                   
sum divide:   0.99984100919839991

but in my new environment the output is now:

rolling mean: 0.99984100919840002                                                   
sum divide:   0.99984100919839991

As you can see, the rolling mean now gives a slightly different number. It's a small difference for sure, but the errors get compounded and it ends up being non-trivial.

Does anyone know what could be causing it or if there's a workaround?

Yoel · Accepted Answer

The reason for the difference in the outcome of the different approaches is an accumulated rounding error that is greater during the sum divide computation. In the past, the rolling mean computation suffered from a similar issue, but it seems that internal improvements in its algorithm over the past few versions have brought it to a more precise result.

First of all, let's establish that the new rolling mean result is more precise. We shall do it by invoking the sum divide approach twice, but each time with a different precision:

In [166]: ser1 = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'))

In [167]: type(ser1[0])
Out[167]: numpy.float64

In [168]: print "sum divide:   %.17f" % (ser1['2008-6-1':'2008-6-6'].sum()/5)
sum divide:   0.99984100919839991

In [169]: ser2 = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'), dtype = np.float128)

In [170]: print "sum divide:   %.17f" % (ser2['2008-6-1':'2008-6-6'].sum()/5)
sum divide:   0.99984100919840002

Using the greater np.float128 precision results in a value closer to that of the new rolling mean version. This clearly proves that the new rolling mean version is more precise than the previous one.

This also suggests a possible workaround to your problem - employ a greater precision in your calculations by defining your series to hold objects of np.float128. This improves the precision of the sum divide approach, but doesn't affect that of the rolling mean approach:

In [185]: pd.stats.moments.rolling_mean(ser1, window=5, min_periods=1) == pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)
Out[185]: 
2008-05-28    True
2008-05-29    True
2008-05-30    True
2008-06-02    True
2008-06-03    True
2008-06-04    True
2008-06-05    True
2008-06-06    True
Freq: B, dtype: bool

Note that even though this brings the results of each of the approaches closer together, and they even seem identical:

In [194]: print "sum divide:   %.60f" % (ser2['2008-6-1':'2008-6-6'].sum()/5)
sum divide:   0.999841009198400021418251526483800262212753295898437500000000

In [195]: print "rolling mean: %.60f" % pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06']
rolling mean: 0.999841009198400021418251526483800262212753295898437500000000

from the processor's point of view, they still differ:

In [196]: pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06'] == ser2['2008-6-1':'2008-6-6'].sum()/5
Out[196]: False

In [197]: pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06'] - ser2['2008-6-1':'2008-6-6'].sum()/5
Out[197]: 4.4398078963281406573e-17

but hopefully the margin of error, which is a bit smaller now, falls in within your use-case.

rolling_mean instability in pandas

Answers (1)

Related Questions