Yingxue Pan
Yingxue Pan

Reputation: 1

Pandas: isin() doesn't produce the same result when the list it takes hold elements in different orders

I am trying to filter columns with data types including 'object', 'datetime64[ns]', and 'int64'. I get different results when the data type list is in different orders.
Specifically, when I run train.dtypes.isin(['object', 'datetime64[ns]', 'int64']), I get customer_ID is in the data type list,
but when I run train.dtypes.isin(['datetime64[ns]', 'int64', 'object']), I get customer_ID is not in the data type list.
customer_ID has the 'object' type when I run train.dtypes so I expect these two lines give the same result but in fact they don't. Why does the order matters? Does this have anything to do with how pandas compare data types and how isin() handles multiple comparison?

I have tried using explicit OR operand | instead of using isin(): train.dtypes[(train.dtypes == 'datetime64[ns]') | (train.dtypes == 'int64') | (train.dtypes == 'object')]. This returns the result I expect. But still I wonder why isin() does not work as expected.

Upvotes: 0

Views: 49

Answers (1)

Bogdan Shevchenko
Bogdan Shevchenko

Reputation: 439

I don't have an answer, but can share some findings:

Firstly, reproducible example:

import pandas as pd

t = pd.DataFrame([
    ['a', 1, '01-01-2000']
], columns=['o', 'i', 'd'])
t['d'] = pd.to_datetime(t.d)

print(t.dtypes.isin(['object', 'datetime64[ns]', 'int64']).tolist())
print(t.dtypes.isin(['datetime64[ns]', 'int64', 'object']).tolist())
# [False, True, False]
# [True, True, False]

dtypes of t are:

o            object
i             int64
d    datetime64[ns]
dtype: object 

Or, another way

print(t.dtypes.tolist())
# dtype('O'), dtype('int64'), dtype('<M8[ns]')

Secondly, I dive a little into pandas code, and can say that problem is in this function: https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/core/algorithms.py#L545

and its results are really strange and hard to explain:

from pandas.core import algorithms
from pandas._libs import hashtable

values = np.array(['object', 'datetime64[ns]', 'int64', 'qwerty', '<M8[ns]'], dtype=object)
for ind in [[0, 1, 2], [2, 1, 0], [1, 2, 0], [1, 0, 2], 
            [0, 2], [2, 0], [2, 0, 2], [0, 2, 2], [0, 2, 3], 
            [0, 1, 2, 3], [0, 2, 3, 1], [0, 4, 2, 3], [0, 2, 3, 4],
            [0], [1], [2]]:
    value = values[ind]
    print(value, ":", hashtable.ismember(t.dtypes.values, value))
#  ['object' 'datetime64[ns]' 'int64'] : [False  True False]
#  ['int64' 'datetime64[ns]' 'object'] : [ True  True False]
#  ['datetime64[ns]' 'int64' 'object'] : [ True  True False]
#  ['datetime64[ns]' 'object' 'int64'] : [False  True False]
#  ['object' 'int64'] : [False False False]
#  ['int64' 'object'] : [False  True False]
#  ['int64' 'object' 'int64'] : [False  True False]
#  ['object' 'int64' 'int64'] : [False False False]
#  ['object' 'int64' 'qwerty'] : [False  True False]
#  ['object' 'datetime64[ns]' 'int64' 'qwerty'] : [False False False]
#  ['object' 'int64' 'qwerty' 'datetime64[ns]'] : [False False False]
#  ['object' '<M8[ns]' 'int64' 'qwerty'] : [False False False]
#  ['object' 'int64' 'qwerty' '<M8[ns]'] : [False False  True]
#  ['object'] : [False False False]
#  ['datetime64[ns]'] : [False False False]
#  ['int64'] : [False False False]

For 'normal' arrays it works as expected:

hashtable.ismember(
    np.array([1, 2, 3, 4], dtype=object), 
    np.array([3.0, 2.0, 1.0], dtype=object)
)
# True, True, True, False

But this is dead end for me. In this function code I couldn't find any hints, why magic occurs. https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/_libs/hashtable_func_helper.pxi.in#L210

Probably problem is in hashing of the object type algorithm, but I cannot find it's realization.

Upvotes: 0

Related Questions