Reputation: 282
I'm performing an upgrade from our current environment (Python 2.7.3 64-bit, pandas 0.9) to a new one (Python 2.7.6, pandas 0.14.1) and some of my regression tests are failing. I tracked it down to the behavior of pandas.stats.moments.rolling_mean
Here is a sample to reproduce the error:
import pandas as pd
data = [
1.0,
0.99997000000000003,
0.99992625131299995,
0.99992500140499996,
0.99986125618599997,
0.99981126312299995,
0.99976377208800005,
0.99984375318999996]
ser = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'))
print "rolling mean: %.17f" % pd.stats.moments.rolling_mean(ser, window=5, min_periods=1)['2008-06-06']
print "sum divide: %.17f" % (ser['2008-6-1':'2008-6-6'].sum()/5)
In my original environment, I get the following output:
rolling mean: 0.99984100919839991
sum divide: 0.99984100919839991
but in my new environment the output is now:
rolling mean: 0.99984100919840002
sum divide: 0.99984100919839991
As you can see, the rolling mean now gives a slightly different number. It's a small difference for sure, but the errors get compounded and it ends up being non-trivial.
Does anyone know what could be causing it or if there's a workaround?
Upvotes: 3
Views: 900
Reputation: 9614
The reason for the difference in the outcome of the different approaches is an accumulated rounding error that is greater during the sum divide computation. In the past, the rolling mean computation suffered from a similar issue, but it seems that internal improvements in its algorithm over the past few versions have brought it to a more precise result.
First of all, let's establish that the new rolling mean result is more precise. We shall do it by invoking the sum divide approach twice, but each time with a different precision:
In [166]: ser1 = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'))
In [167]: type(ser1[0])
Out[167]: numpy.float64
In [168]: print "sum divide: %.17f" % (ser1['2008-6-1':'2008-6-6'].sum()/5)
sum divide: 0.99984100919839991
In [169]: ser2 = pd.Series(data, index=pd.date_range('2008-05-28', '2008-06-06', freq='B'), dtype = np.float128)
In [170]: print "sum divide: %.17f" % (ser2['2008-6-1':'2008-6-6'].sum()/5)
sum divide: 0.99984100919840002
Using the greater np.float128
precision results in a value closer to that of the new rolling mean version. This clearly proves that the new rolling mean version is more precise than the previous one.
This also suggests a possible workaround to your problem - employ a greater precision in your calculations by defining your series to hold objects of np.float128
. This improves the precision of the sum divide approach, but doesn't affect that of the rolling mean approach:
In [185]: pd.stats.moments.rolling_mean(ser1, window=5, min_periods=1) == pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)
Out[185]:
2008-05-28 True
2008-05-29 True
2008-05-30 True
2008-06-02 True
2008-06-03 True
2008-06-04 True
2008-06-05 True
2008-06-06 True
Freq: B, dtype: bool
Note that even though this brings the results of each of the approaches closer together, and they even seem identical:
In [194]: print "sum divide: %.60f" % (ser2['2008-6-1':'2008-6-6'].sum()/5)
sum divide: 0.999841009198400021418251526483800262212753295898437500000000
In [195]: print "rolling mean: %.60f" % pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06']
rolling mean: 0.999841009198400021418251526483800262212753295898437500000000
from the processor's point of view, they still differ:
In [196]: pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06'] == ser2['2008-6-1':'2008-6-6'].sum()/5
Out[196]: False
In [197]: pd.stats.moments.rolling_mean(ser2, window=5, min_periods=1)['2008-06-06'] - ser2['2008-6-1':'2008-6-6'].sum()/5
Out[197]: 4.4398078963281406573e-17
but hopefully the margin of error, which is a bit smaller now, falls in within your use-case.
Upvotes: 4