D. A.
D. A.

Reputation: 3509

Conversion from numpy.datetime64 to pandas.tslib.Timestamp bug?

I have a python module that loads data directly in to a dict of numpy.ndarray for use in a pandas.Dataframe. However, I noticed an issue with 'NA' values. My file format represents NA values a s -9223372036854775808 (boost::integer_traits::const_min). My non-NA values are loading as expected (with the right values) into pandas.Dataframe. I believe what is happening is that my module loads into a numpy.datetime64 ndarray, which then is converted to a list of pandas.tslib.Timestamp. This conversion doesn't seem to preserve the 'const_min' integer. Trye the following:

>>> pandas.tslib.Timestamp(-9223372036854775808)
NaT
>>> pandas.tslib.Timestamp(numpy.datetime64(-9223372036854775808))
<Timestamp: 1969-12-31 15:58:10.448384>

Is this a Pandas bug? I think I can have my module avoid using a numpy.ndarray in this case, and use something Pandas doesn't trip on (perhaps pre-allocate the list of tslib.Timestamp itself.)

Here is another example of unexpected things happening:

>>> npa = numpy.ndarray(1, dtype=numpy.datetime64)
>>> npa[0] = -9223372036854775808
>>> pandas.Series(npa)
0   NaT
>>> pandas.Series(npa)[0]
<Timestamp: 1969-12-31 15:58:10.448384>

Following Jeff's comment below, I have more information about what is going wrong.

>>> npa = numpy.ndarray(2, dtype=numpy.int64)
>>> npa[0] = -9223372036854775808
>>> npa[1] = 1326834000090451
>>> npa
array([-9223372036854775808,     1326834000090451])
>>> s_npa = pandas.Series(npa, dtype='M8[us]')
>>> s_npa
0                          NaT
1   2012-01-17 21:00:00.090451

Yay! The series preserved the NA and my timestamp. However, if I attempt to create a DataFrame from that series, the NaT disappears.

>>> pandas.DataFrame({'ts':s_npa})
                      ts
0 1969-12-31 15:58:10.448384
1 2012-01-17 21:00:00.090451

Ho-hum. On a whim, I tried interpreting my integers as nano-seconds past epoch instead. To my surprise, the DataFrame worked properly:

s2_npa = pandas.Series(npa, dtype='M8[ns]')
>>> s2_npa
0                             NaT
1   1970-01-16 08:33:54.000090451
>>> pandas.DataFrame({"ts":s2_npa})
                             ts
0                           NaT
1 1970-01-16 08:33:54.000090451

Of course, my timestamp is not right. My point is that pandas.DataFrame is behaving inconsistently here. Why does it preserve the NaT when using dtype='M8[ns]', but not when using 'M8[us]'?

I am currently using this workaround to convert an , which slows things down quite a bit, but works:

>>> s = pandas.Series([1000*ts if ts != -9223372036854775808 else ts for ts in npa], dtype='M8[ns]')
>>> pandas.DataFrame({'ts':s})
                          ts
0                        NaT
1 2012-01-17 21:00:00.090451

(Several hours later...)

Okay, I have progress. I've delved into the code to realize that the repr function on Series eventually calls '_format_datetime64', which checks 'isnull' and will print out 'NaT' That explains the difference between these two.

>>> pandas.Series(npa)
0   NaT
>>> pandas.Series(npa)[0]
<Timestamp: 1969-12-31 15:58:10.448384>

The former seems to honor the NA, but it only does so when printing. I suppose there may be other pandas functions that call 'isnull' and act based on the answer, which might seem to partially work for NA timestamps in this case. However, I know that the Series is incorrect due to the type of element zero. It is a Timestamp, but should be a NaTType. My next step is to dive into the constructor for Series to figure out when/how pandas uses the NaT value during construction. Presumably, it is missing a case when I specify dtype='M8[us]'... (more to come).

Following Andy's suggestion in the comments, I tried using a pandas Timestamp to resolve the issue. It didn't work. Here is an example of those results:

>>> npa = numpy.ndarray(1, dtype='i8')
>>> npa[0] = -9223372036854775808
>>> npa
array([-9223372036854775808])
>>> pandas.tslib.Timestamp(npa.view('M8[ns]')[0]).value
-9223372036854775808
>>> pandas.tslib.Timestamp(npa.view('M8[us]')[0]).value
-28909551616000

Upvotes: 3

Views: 4617

Answers (1)

D. A.
D. A.

Reputation: 3509

Answer: No

Technically speaking, that is. I posted the bug on github and got a response here: https://github.com/pydata/pandas/issues/2800#issuecomment-13161074

"Units other than nanoseconds are not supported right now in indexing etc. This should be strictly enforced"

All of the tests I've run with 'ns' rather than 'us' work fine. I'm looking forward to a future release.

For anyone interested, I modified my C++ python module to iterate over the int64_t arrays that I loaded from disk, and multiply everything by 1000, except for NA values (boost::integer_traits::const_min). I was worried about the performance, but the difference in load time is tiny for me. (Doing the same in Python is very, very slow.)

Upvotes: 2

Related Questions