What is the best way to calculate the mean of the values of a pandas dataframe with np.nan in it?

Question

I'm trying to calculate the mean of the values (all of them numeric, not like in the 'How to calculate the mean of a pandas DataFrame with NaN values' question) of a pandas dataframe containing a lot of np.nan in it.

I've came with this code, that works quite well by the way :

my_df = pd.DataFrame ([[0,10,np.nan,220],\
[1,np.nan,21,221],[2,12,22,np.nan],[np.nan,13,np.nan,np.nan]])

print(my_df.values.flatten()[~np.isnan(my_df.values.flatten())].mean())

However, I found that this line of code gives the same result, which I don't understand why :

print(my_df.values[~np.isnan(my_df.values)].mean())

Is this really the same, and can I use it safely ? I mean, my_df.values[~np.isnan(my_df.values) is still an array that is not flat and so what happened to the np.nan in it ?

Any improvement is welcome if you see a more efficient and pythonic way to do that. Thanks a lot.

willeM_ Van Onsem · Accepted Answer

Is this really the same, and can I use it safely ?

Yes, since numpy here masks away the NaNs, and it will then calculate the mean over that array. But you make it overcomplicated here.

You can use numpy's nanmean(..) [numpy-doc] here:

>>> np.nanmean(my_df)
52.2

The NaN values are thus not take into account (not in the sum nor in the count of the mean). I think this is probably more declarative than calculating the mean with masking, since the above says what you are doing, and not that much how you are doing that.

In case you want to count the NaNs, we can replace these with 0 like @abdullah.cu says, like:

>>> my_df.fillna(0).values.mean()
32.625

What is the best way to calculate the mean of the values of a pandas dataframe with np.nan in it?

Answers (1)

Related Questions