Reputation: 1
I am trying to filter columns with data types including 'object', 'datetime64[ns]', and 'int64'. I get different results when the data type list is in different orders.
Specifically, when I run train.dtypes.isin(['object', 'datetime64[ns]', 'int64'])
, I get customer_ID is in the data type list,
but when I run train.dtypes.isin(['datetime64[ns]', 'int64', 'object'])
, I get customer_ID is not in the data type list.
customer_ID has the 'object' type when I run train.dtypes
so I expect these two lines give the same result but in fact they don't. Why does the order matters? Does this have anything to do with how pandas compare data types and how isin() handles multiple comparison?
I have tried using explicit OR operand |
instead of using isin(): train.dtypes[(train.dtypes == 'datetime64[ns]') | (train.dtypes == 'int64') | (train.dtypes == 'object')]
. This returns the result I expect. But still I wonder why isin() does not work as expected.
Upvotes: 0
Views: 49
Reputation: 439
I don't have an answer, but can share some findings:
Firstly, reproducible example:
import pandas as pd
t = pd.DataFrame([
['a', 1, '01-01-2000']
], columns=['o', 'i', 'd'])
t['d'] = pd.to_datetime(t.d)
print(t.dtypes.isin(['object', 'datetime64[ns]', 'int64']).tolist())
print(t.dtypes.isin(['datetime64[ns]', 'int64', 'object']).tolist())
# [False, True, False]
# [True, True, False]
dtypes
of t
are:
o object
i int64
d datetime64[ns]
dtype: object
Or, another way
print(t.dtypes.tolist())
# dtype('O'), dtype('int64'), dtype('<M8[ns]')
Secondly, I dive a little into pandas code, and can say that problem is in this function: https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/core/algorithms.py#L545
and its results are really strange and hard to explain:
from pandas.core import algorithms
from pandas._libs import hashtable
values = np.array(['object', 'datetime64[ns]', 'int64', 'qwerty', '<M8[ns]'], dtype=object)
for ind in [[0, 1, 2], [2, 1, 0], [1, 2, 0], [1, 0, 2],
[0, 2], [2, 0], [2, 0, 2], [0, 2, 2], [0, 2, 3],
[0, 1, 2, 3], [0, 2, 3, 1], [0, 4, 2, 3], [0, 2, 3, 4],
[0], [1], [2]]:
value = values[ind]
print(value, ":", hashtable.ismember(t.dtypes.values, value))
# ['object' 'datetime64[ns]' 'int64'] : [False True False]
# ['int64' 'datetime64[ns]' 'object'] : [ True True False]
# ['datetime64[ns]' 'int64' 'object'] : [ True True False]
# ['datetime64[ns]' 'object' 'int64'] : [False True False]
# ['object' 'int64'] : [False False False]
# ['int64' 'object'] : [False True False]
# ['int64' 'object' 'int64'] : [False True False]
# ['object' 'int64' 'int64'] : [False False False]
# ['object' 'int64' 'qwerty'] : [False True False]
# ['object' 'datetime64[ns]' 'int64' 'qwerty'] : [False False False]
# ['object' 'int64' 'qwerty' 'datetime64[ns]'] : [False False False]
# ['object' '<M8[ns]' 'int64' 'qwerty'] : [False False False]
# ['object' 'int64' 'qwerty' '<M8[ns]'] : [False False True]
# ['object'] : [False False False]
# ['datetime64[ns]'] : [False False False]
# ['int64'] : [False False False]
For 'normal' arrays it works as expected:
hashtable.ismember(
np.array([1, 2, 3, 4], dtype=object),
np.array([3.0, 2.0, 1.0], dtype=object)
)
# True, True, True, False
But this is dead end for me. In this function code I couldn't find any hints, why magic occurs. https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/_libs/hashtable_func_helper.pxi.in#L210
Probably problem is in hashing of the object type algorithm, but I cannot find it's realization.
Upvotes: 0