Reputation: 30424
I'm wondering if this is a bug, or possibly I don't understand how nanmean should work with a dataframe. Seems to work if I convert the dataframe to an array, but not directly on the dataframe, nor is any exception raised. Originally noticed here: Fill data gaps with average of data from adjacent days
df1 = DataFrame({ 'x': [1,3,np.nan] })
df2 = DataFrame({ 'x': [2,np.nan,5] })
x
0 1
1 3
2 NaN
x
0 2
1 NaN
2 5
In [1503]: np.nanmean( [df1,df2], axis=0 )
Out[1503]:
x
0 1.5
1 NaN
2 NaN
In [1504]: np.nanmean( [df1.values, df2.values ], axis=0 )
Out[1504]:
array([[ 1.5],
[ 3. ],
[ 5. ]])
Upvotes: 1
Views: 2363
Reputation: 5045
It's definitely strange behavior. I don't have the answers, but it mostly seems that entire pandas DataFrames
can be elements of numpy arrays, which results in strange behavior. I'm guessing this should be avoided as much as possible, and I'm not sure why DataFrames
are valid numpy elements at all.
np.nanmean
probably converts the arguments into an np.array
before applying operations. So lets look at
a = np.array([df1, df2])
First note that this is not a 3-d array like you might think, it's actually a 1-d array, where each element is a DataFrame
.
print(a.shape)
# (2,)
print(type(a[0]))
# <class 'pandas.core.frame.DataFrame'>
So nanmean
is taking the mean of both of the DataFrame
s, not of the values inside the dataframes. This also means that the axis argument isn't actually doing anything, and if you try using axis=1
you'll get an error because it's a 1-d array.
np.nanmean(a, axis=1)
# IndexError: tuple index out of range
print(np.nanmean(a))
# x
# 0 1.5
# 1 NaN
# 2 NaN
That's why you're getting a different answer than when you create the array with values. When you use values, it properly creates the 3-d array of numbers, rather than the weird 1-d array of dataframes.
b = np.array([df1.values, df2.values ])
print(b.shape)
# (2, 3, 1)
print(type(b[1]))
# <class 'numpy.ndarray'>
print(type(b[0,0,0]))
# <class 'numpy.float64'>
These arrays of dataframes have some especially weird behavior though. Say that we make a 3-length array where the third element is np.nan
. You might expect to get the same answer from nanmean
as we did with a
before, as it should exclude the nan
value, right?
print(np.nanmean(np.array([df1, df2, np.nan])))
# x
# 0 NaN
# 1 NaN
# 2 NaN
Yea, so I'm not sure. Best to avoid making these.
Upvotes: 1