Reputation: 59519
There seems to be inconsistent behavior in the !=
comparison depending upon whether the item belongs to the categories. If the value is in the categories != NaN
returns False, seemingly inconsistent with how the normal != NaN
comparison would evaluate. When the value is not in the categories, the behavior seems expected.
import pandas as pd
import numpy as np
# Standard evaluation
'11' != np.NaN
#True
'A' != np.NaN
#True
s = pd.Series([np.NaN, '11']).astype('category')
s.ne('11')
#0 False # <- What?
#1 False
#dtype: bool
s.ne('A')
#0 True
#1 True
#dtype: bool
# Without the category type the behavior is correct
pd.Series([np.NaN, '11']).ne('11')
#0 True
#1 False
#dtype: bool
Is this a bug, or for some reason the expected NaN
behavior within categories? pd.__version__ = 0.25.0
, but also appears on 1.0
.
Upvotes: 0
Views: 204
Reputation: 88226
The reason seems the way NaN
s are treated when working with category type data. With Categorical data, values that are not included in categories are replaced by NaN
, i.e a NaN
is treated just as a non existent category. We can check by creating a series as the following and specifying the existent categories:
c = pd.Categorical(values=['1','2',np.nan,'3','4'], categories=['1','2','3'])
print(c)
[1, 2, NaN, 3, NaN]
Categories (3, object): [1, 2, 3]
And by checking the docs we see that:
Missing values should not be included in the Categorical’s categories, only in the values. Instead, it is understood that NaN is different, and is always a possibility
So missing values are being considered to always be a possibility when compared to a value from an existent category.
Using the above example we can see the same behavior with a missing value NaN
and a value of a non existent category '4'
:
c != '3'
array([ True, True, False, False, False])
Upvotes: 2