Xin Niu
Xin Niu

Reputation: 635

function any is not consistent when applied on columns or the whole dataframe in python

I have a dataframe that might contain NaN values.

array = np.empty((4,5))
array[:] = 10
df = pd.DataFrame(array)
df.iloc[1,3] = np.NaN

df.isna().apply(lambda x: any(x), axis = 0)

Output:

0    False
1    False
2    False
3     True
4    False
dtype: bool

When I run:

any(df.isna())

It returns:

True

If there are no NaNs:

array = np.empty((4,5))
array[:] = 10
df = pd.DataFrame(array)
#df.iloc[1,3] = np.NaN

df.isna().apply(lambda x: any(x), axis = 0)

0    False
1    False
2    False
3    False
4    False
dtype: bool

However when I run:

any(df.isna())

It returns:

True

Why this is the case? Do I have any misunderstanding of the function any()?

Upvotes: 1

Views: 72

Answers (1)

Rodalm
Rodalm

Reputation: 5433

Why this is the case? Do I have any misunderstanding of the function any()?

When you loop over a DataFrame you are actually iterating over its column labels, not its rows or values as you might think. More precisely, the for loop calls Dataframe.__iter__ which returns an iterator over the column labels of the DataFrame. For instance, in the following

df = pd.DataFrame(columns=['a', 'b', 'c'])
for x in df:
    print(x)

# Output:
#
# a
# b
# c

x holds the name of each df column. You can also see what is the output of list(df).

This means that when you do any(df.isna()), under the hood any is actually iterating over the column labels of df and checking their truthiness. If at least one is truthy it returns True.

In both of your examples the column labels are numbers list(df.isna()) = list(df.columns) = [0, 1, 2, 3], from which only 0 is a Falsy value. Therefore, in both cases any(df.isna()) = True.


Solution

The solution is to use DataFrame.any with axis=None instead of using the built-in any function.

df.isna().any(axis=None)

Upvotes: 1

Related Questions