JohnE
JohnE

Reputation: 30424

Numpy nanmean and dataframe (possible bug?)

I'm wondering if this is a bug, or possibly I don't understand how nanmean should work with a dataframe. Seems to work if I convert the dataframe to an array, but not directly on the dataframe, nor is any exception raised. Originally noticed here: Fill data gaps with average of data from adjacent days

df1 = DataFrame({ 'x': [1,3,np.nan] })
df2 = DataFrame({ 'x': [2,np.nan,5] })

    x
0   1
1   3
2 NaN

    x
0   2
1 NaN
2   5

In [1503]: np.nanmean( [df1,df2], axis=0 )
Out[1503]: 
     x
0  1.5
1  NaN
2  NaN

In [1504]: np.nanmean( [df1.values, df2.values ], axis=0 )
Out[1504]: 
array([[ 1.5],
       [ 3. ],
       [ 5. ]])

Upvotes: 1

Views: 2363

Answers (1)

Roger Fan
Roger Fan

Reputation: 5045

It's definitely strange behavior. I don't have the answers, but it mostly seems that entire pandas DataFrames can be elements of numpy arrays, which results in strange behavior. I'm guessing this should be avoided as much as possible, and I'm not sure why DataFrames are valid numpy elements at all.

np.nanmean probably converts the arguments into an np.array before applying operations. So lets look at

a = np.array([df1, df2])

First note that this is not a 3-d array like you might think, it's actually a 1-d array, where each element is a DataFrame.

print(a.shape)
# (2,)

print(type(a[0]))
# <class 'pandas.core.frame.DataFrame'>

So nanmean is taking the mean of both of the DataFrames, not of the values inside the dataframes. This also means that the axis argument isn't actually doing anything, and if you try using axis=1 you'll get an error because it's a 1-d array.

np.nanmean(a, axis=1)
# IndexError: tuple index out of range

print(np.nanmean(a))
#      x
# 0  1.5
# 1  NaN
# 2  NaN

That's why you're getting a different answer than when you create the array with values. When you use values, it properly creates the 3-d array of numbers, rather than the weird 1-d array of dataframes.

b = np.array([df1.values, df2.values ])

print(b.shape)
# (2, 3, 1)

print(type(b[1]))
# <class 'numpy.ndarray'>

print(type(b[0,0,0]))
# <class 'numpy.float64'>

These arrays of dataframes have some especially weird behavior though. Say that we make a 3-length array where the third element is np.nan. You might expect to get the same answer from nanmean as we did with a before, as it should exclude the nan value, right?

print(np.nanmean(np.array([df1, df2, np.nan])))
#     x
# 0 NaN
# 1 NaN
# 2 NaN

Yea, so I'm not sure. Best to avoid making these.

Upvotes: 1

Related Questions