Reputation: 6480
>>> import numpy as np
>>> from scipy import stats
>>> a = np.r_[1., 2., np.nan, 4., 5.]
>>> stats.nanmean(a)
2.9999999999999996
>>> np.nansum(a)/np.sum(~np.isnan(a))
3.0
I'm aware of the limitation of floating point representation. Just curious why the more clumsy expression seems to give "better" result.
Upvotes: 4
Views: 3272
Reputation: 500733
First of all, here is scipy.nanmean()
so that we know what we're comparing to:
def nanmean(x, axis=0):
x, axis = _chk_asarray(x,axis)
x = x.copy()
Norig = x.shape[axis]
factor = 1.0-np.sum(np.isnan(x),axis)*1.0/Norig
x[np.isnan(x)] = 0
return np.mean(x,axis)/factor
Mathematically, the two methods are equivalent. Numerically, they are different.
Your method involves a single division, and it so happens that:
1. + 2. + 4. + 5.
) can be represented exactly as a float
; and4.
) is a power of two.This means that the result of the division is exact, 3.
.
stats.nanmean()
involves first computing the mean of [1., 2., 0., 4., 5.]
, and then adjusting it to account for NaNs
. As it happens, this mean (2.4
) cannot be represented exactly as a float
, so from this point on the computation is inexact.
I haven't given it a lot of thought, but it may be possible to construct an example where the roles would be reversed, and stats.nanmean()
would give a more accurate result than the other method.
What surprises me is that stats.nanmean()
doesn't simply do something like:
In [6]: np.mean(np.ma.MaskedArray(a, np.isnan(a)))
Out[6]: 3.0
This seems to me to be a superior approach to what it does currently.
Upvotes: 8
Reputation: 63767
As @eumiro mentioned, stats.nanmean calculated the mean in a circumlocutions way different from the straightforward one liner way you did
From the same reference code,
np.sum(np.isnan(x),axis)
returns numpy.int32
which when multiplied by * 1.0
, results a floating point approximation as opposed to what it would have got when the result would have been integer resulting in a difference in result
>>> numpy.int32(1)*1.0/5
0.20000000000000001
>>> int(numpy.int32(1))*1.0/5
0.2
>>> type(np.sum(np.isnan(x),axis))
<type 'numpy.int32'>
Upvotes: 1
Reputation: 213015
The answer is in the code of stats.nanmean
:
x, axis = _chk_asarray(x,axis)
x = x.copy()
Norig = x.shape[axis]
factor = 1.0-np.sum(np.isnan(x),axis)*1.0/Norig
x[np.isnan(x)] = 0
return np.mean(x,axis)/factor
I believe it has something to do with the 1.0 - np.sum
, a substraction of the sum.
Upvotes: 2