Chris J Harris
Chris J Harris

Reputation: 1851

numpy isin issue for timestamps?

I'm having a weird problem with the np.isin function. If I create a short pd.DatetimeIndex, and a date which exists within that index:

test_index = pd.date_range(start='2000-01-01', end='2000-01-15',freq='B')
test_date = test_index[0]

I can check that the test_date is in fact the first element of the index:

test_date == test_index[0]
True

But the np.isin function seems to be unable to recognize test_date within test_index:

np.isin(test_index, test_date)
array([False, False, False, False, False, False, False, False, False,
       False])

This occurs if I write this as

np.isin(test_index.values, test_date)

This seems wrong and weird. The data type of both test_date and test_index[0] is given as pd.Timestamp and there's no visible difference between them. Any help gratefully received.

Upvotes: 4

Views: 798

Answers (1)

alkasm
alkasm

Reputation: 23022

This isn't a numpy issue, it's a pandas issue. The problem is because pd.date_range creates a DatetimeIndex, which is a special type of index and stores the objects differently than what you get when you access them. From the docs on DatetimeIndex:

Immutable ndarray of datetime64 data, represented internally as int64, and which can be boxed to Timestamp objects that are subclasses of datetime and carry metadata such as frequency information.

That is hard to parse. "Array of type1 data, represented as type2, that gives you type3 objects when you index."

I actually do not get the same type for each from Pandas; the type of the test_date is pandas._libs.tslib.Timestamp for Pandas 0.22.0, which is in line with this documentation.

>>> test_index.dtype 
dtype('<M8[ns]')

>>> type(test_date)
pandas._libs.tslib.Timestamp

As the docs state, this Timestamp has additional metadata, which does not convert well in numpy:

>>> np.array(test_date)
array(Timestamp('2000-01-03 00:00:00', freq='B'), dtype=object)

You can see I just got an object...that object is definitely not what is stored in the DatetimeIndex. This is what actually happens implicitly in numpy. From the docs on np.isin() (in the Notes section):

If test_elements is a set (or other non-sequence collection) it will be converted to an object array with one element.

So as we can see, the value is getting pushed into this object array, instead of a datetime64 array, so you won't find your object in the test_index array.

The best bet is to use the built-in methods on a DatetimeIndex to search it, but you could also explicitly cast so numpy knows what's going on. Here are some different ways you could do this:

>>> np.isin(test_index, np.datetime64(test_date))
array([ True, False, False, False, False, False, False, False, False,
   False])
>>> test_index == test_date
array([ True, False, False, False, False, False, False, False, False,
   False])
>>> test_index.isin([test_date])
array([ True, False, False, False, False, False, False, False, False,
   False])
>>> test_index.contains(test_date) # if you just need yes or no
True

Upvotes: 6

Related Questions