jwd
jwd

Reputation: 11114

Why does Pandas Series.isin work for strings but not numbers?

Simple example:

>>> df = pd.DataFrame(
         columns=['x', 'y', 'z'],
         data=np.array([
             ['a', 1, 'foo'],
             ['b', 2, 'bar'],
             ['c', 3, 'biz'],
             ['d', 99, 'baz'] ]))
>>> df
   x   y    z
0  a   1  foo
1  b   2  bar
2  c   3  biz
3  d  99  baz

>>> df[df.z.isin(['foo', 'biz'])]
   x  y    z
0  a  1  foo
2  c  3  biz

That works as expected!

However, now I try to use y:

>>> df[df.y.isin([1,3])]
Empty DataFrame
Columns: [x, y, z]
Index: []

What just happened?

I would have expected the same two rows to be output as in the above .z.isin(...) example.

Upvotes: 1

Views: 1851

Answers (2)

cs95
cs95

Reputation: 402523

Let's look at the source of the problem. It's actually the call to np.array.

np.array([['a', 1, 'foo'],
          ['b', 2, 'bar'],
          ['c', 3, 'biz'],
          ['d', 99, 'baz']])

This actually coerces the integers to strings:

array([['a', '1', 'foo'],
       ['b', '2', 'bar'],
       ['c', '3', 'biz'],
       ['d', '99', 'baz']], dtype='<U3')

Notice the second column is all strings, because of type coercion. OTOH, if you initialise the array with an explicit dtype=object, the individual types are preserved:

data = np.array([['a', 1, 'foo'],
                 ['b', 2, 'bar'],
                 ['c', 3, 'biz'],
                 ['d', 99, 'baz']], dtype=object)

df = pd.DataFrame(columns=['x', 'y', 'z'], data=data)
df.y.isin([1,3])

0     True
1    False
2     True
3    False
Name: y, dtype: bool

Or, better still, pass a heterogenous list of lists (without conversion to array).

df = pd.DataFrame(data=[['a', 1, 'foo'],
                        ['b', 2, 'bar'],
                        ['c', 3, 'biz'],
                        ['d', 99, 'baz']], 
                  columns=list('xyz'))
df.y.isin([1,3])

0     True
1    False
2     True
3    False
Name: y, dtype: bool

Upvotes: 3

sedavidw
sedavidw

Reputation: 11691

If you look at df.y it is of type object, if you convert it to an int you will get the behavior you expect

In [8]: df.y
Out[8]: 
0     1
1     2
2     3
3    99
Name: y, dtype: object

Upvotes: 1

Related Questions