Heikki
Heikki

Reputation: 319

Pandas datetime64 with longer range

I have a DataFrame with datetime values spanning from year 1 to way into future. When I try to import the data into pandas the dtype gets set to object although I would like it to be datetime64 to use the .dt accessor.

Consider this piece of code:

import pytz
from datetime import datetime
import pandas as pd

df = pd.DataFrame({'dates': [datetime(108, 7, 30, 9, 25, 27, tzinfo=pytz.utc),
                             datetime(2018, 3, 20, 9, 25, 27, tzinfo=pytz.utc),
                             datetime(2529, 7, 30, 9, 25, 27, tzinfo=pytz.utc)]})
In [5]: df.dates
Out[5]: 
0    0108-07-30 09:25:27+00:00
1    2018-03-20 09:25:27+00:00
2    2529-07-30 09:25:27+00:00
Name: dates, dtype: object

How can I convert it to dtype datetime64[s]? I don't really care about nano/millisecond accuracy, but I would like the range.

Upvotes: 0

Views: 1160

Answers (1)

abarnert
abarnert

Reputation: 365875

Pandas can generally convert to and from datetime.datetime objects:

df.dates = pd.to_datetime(df.dates)

But in your case, you can't do this, for two reasons.

First, while Pandas can convert to and from datetime.datetime, it can't handle tz-aware datetimes, and you've imbued yours with a timezone. Fortunately, this one is easy to fix—you're explicitly using UTC, and you can do that without aware objects.

Second, 64-bit nanoseconds can't handle a date range as wide as you want:

>>> (1<<64) / / 1000000000 / 3600 / 24 / 365.2425
584.5540492538555

And the Pandas documentation makes this clear:

Since pandas represents timestamps in nanosecond resolution, the time span that can be represented using a 64-bit integer is limited to approximately 584 years:

In [66]: pd.Timestamp.min
Out[66]: Timestamp('1677-09-21 00:12:43.145225')

In [67]: pd.Timestamp.max
Out[67]: Timestamp('2262-04-11 23:47:16.854775807')

(It looks like they put the 0 point at the Unix epoch, which makes sense.)

But notice that the documentation links to Representing Out-of-Bounds Spans: you can use Periods, which will be less efficient and convenient than int64s, but probably more so than objects. (I believe the internal storage ends up being YYYYMMDD-style strings, but they're stored as fixed-length strings directly in the array, instead of as references to Python objects on the heap.)

Upvotes: 1

Related Questions