Reputation: 11114
Simple example:
>>> df = pd.DataFrame(
columns=['x', 'y', 'z'],
data=np.array([
['a', 1, 'foo'],
['b', 2, 'bar'],
['c', 3, 'biz'],
['d', 99, 'baz'] ]))
>>> df
x y z
0 a 1 foo
1 b 2 bar
2 c 3 biz
3 d 99 baz
>>> df[df.z.isin(['foo', 'biz'])]
x y z
0 a 1 foo
2 c 3 biz
That works as expected!
However, now I try to use y
:
>>> df[df.y.isin([1,3])]
Empty DataFrame
Columns: [x, y, z]
Index: []
What just happened?
I would have expected the same two rows to be output as in the above .z.isin(...)
example.
Upvotes: 1
Views: 1851
Reputation: 402523
Let's look at the source of the problem. It's actually the call to np.array
.
np.array([['a', 1, 'foo'],
['b', 2, 'bar'],
['c', 3, 'biz'],
['d', 99, 'baz']])
This actually coerces the integers to strings:
array([['a', '1', 'foo'],
['b', '2', 'bar'],
['c', '3', 'biz'],
['d', '99', 'baz']], dtype='<U3')
Notice the second column is all strings, because of type coercion. OTOH, if you initialise the array with an explicit dtype=object
, the individual types are preserved:
data = np.array([['a', 1, 'foo'],
['b', 2, 'bar'],
['c', 3, 'biz'],
['d', 99, 'baz']], dtype=object)
df = pd.DataFrame(columns=['x', 'y', 'z'], data=data)
df.y.isin([1,3])
0 True
1 False
2 True
3 False
Name: y, dtype: bool
Or, better still, pass a heterogenous list of lists (without conversion to array).
df = pd.DataFrame(data=[['a', 1, 'foo'],
['b', 2, 'bar'],
['c', 3, 'biz'],
['d', 99, 'baz']],
columns=list('xyz'))
df.y.isin([1,3])
0 True
1 False
2 True
3 False
Name: y, dtype: bool
Upvotes: 3
Reputation: 11691
If you look at df.y
it is of type object
, if you convert it to an int you will get the behavior you expect
In [8]: df.y
Out[8]:
0 1
1 2
2 3
3 99
Name: y, dtype: object
Upvotes: 1