ALollz
ALollz

Reputation: 59519

Why does Category dtype not handle != NaN comparisons correctly?

There seems to be inconsistent behavior in the != comparison depending upon whether the item belongs to the categories. If the value is in the categories != NaN returns False, seemingly inconsistent with how the normal != NaN comparison would evaluate. When the value is not in the categories, the behavior seems expected.

import pandas as pd
import numpy as np

# Standard evaluation
'11' != np.NaN
#True

'A' != np.NaN
#True

s = pd.Series([np.NaN, '11']).astype('category')

s.ne('11')
#0    False   # <- What?
#1    False
#dtype: bool

s.ne('A')
#0    True
#1    True
#dtype: bool

# Without the category type the behavior is correct
pd.Series([np.NaN, '11']).ne('11')
#0     True
#1    False
#dtype: bool

Is this a bug, or for some reason the expected NaN behavior within categories? pd.__version__ = 0.25.0, but also appears on 1.0.

Upvotes: 0

Views: 204

Answers (1)

yatu
yatu

Reputation: 88226

The reason seems the way NaNs are treated when working with category type data. With Categorical data, values that are not included in categories are replaced by NaN, i.e a NaN is treated just as a non existent category. We can check by creating a series as the following and specifying the existent categories:

c = pd.Categorical(values=['1','2',np.nan,'3','4'], categories=['1','2','3'])

print(c)
[1, 2, NaN, 3, NaN]
Categories (3, object): [1, 2, 3]

And by checking the docs we see that:

Missing values should not be included in the Categorical’s categories, only in the values. Instead, it is understood that NaN is different, and is always a possibility

So missing values are being considered to always be a possibility when compared to a value from an existent category.

Using the above example we can see the same behavior with a missing value NaN and a value of a non existent category '4':

c != '3'
array([ True,  True, False, False, False])

Upvotes: 2

Related Questions