max
max

Reputation: 52343

Storing pure python datetime.datetime in pandas DataFrame

Since matplotlib doesn't support eitherpandas.TimeStamp ornumpy.datetime64, and there are no simple workarounds, I decided to convert a native pandas date column into a pure python datetime.datetime so that scatter plots are easier to make.

However:

t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31')]})
t.dtypes # date    datetime64[ns], as expected
pure_python_datetime_array = t.date.dt.to_pydatetime() # works fine
t['date'] = pure_python_datetime_array # doesn't do what I hoped
t.dtypes # date    datetime64[ns] as before, no luck changing it

I'm guessing pandas auto-converts the pure python datetime produced by to_pydatetime into its native format. I guess it's convenient behavior in general, but is there a way to override it?

Upvotes: 6

Views: 11386

Answers (3)

PiMathCLanguage
PiMathCLanguage

Reputation: 375

Here is a possible solution with the Series class from pandas:

t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31')]})
t.dtypes # date    datetime64[ns], as expected
pure_python_datetime_array = t.date.dt.to_pydatetime() # works fine
t['date'] = pd.Series(pure_python_datetime_array, dtype=object) # should do what you expect
t.dtypes # object, but the type of the date column is now correct! datetime
type(t.values[0, 0]) # datetime, now you can access the datetime object directly

Why is this working? My assumption is, that you force the dtype for the column date to be an object. So that pandas does not do any intern conversion from datetime.datetime to datetime64.

Correct me otherwise, if I am wrong.

Upvotes: 2

szeitlin
szeitlin

Reputation: 3351

For me, the steps look like this:

  1. convert timezone with pytz
  2. convert to_datetime with pandas and make that the index
  3. plot and autoformat

Starting df looks like this:

before converting timestamps

  1. import pytz ts['posTime']=[x.astimezone( pytz.timezone('US/Pacific')) for x in ts['posTime']]

I can see that it worked because the timestamps changed format:

after timezone conversion

  1. sample['posTime'] = pandas.to_datetime(sample['posTime'])

    sample.index = sample['posTime']

At this point, just plotting with pandas (which uses matplotlib under the hood) gives me a nice rotation and totally the wrong format:

after pandas datetime conversion

  1. However, there's nothing wrong with the format of the objects. I can now make a scatterplot with matplotlib and it autoformats the datetimes as you'd expect.

    plt.scatter(sample['posTime'].values, sample['Altitude'].values)

    fig = plt.gcf()

    fig.set_size_inches(9.5, 3.5)

formatted

  1. If you use the auto format method, you can zoom in and it will continue to automatically choose the appropriate format (but you still have to set the scale manually).

autoformatted

Upvotes: 0

Nehal J Wani
Nehal J Wani

Reputation: 16639

The use of to_pydatetime() is correct.

In [87]: t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31'), pd.to_datetime('2013-12-31')]})

In [88]: t.date.dt.to_pydatetime()
Out[88]: 
array([datetime.datetime(2012, 12, 31, 0, 0),
       datetime.datetime(2013, 12, 31, 0, 0)], dtype=object)

When you assign it back to t.date, it automatically converts it back to datetime64

pandas.Timestamp is a datetime subclass anyway :)

One way to do the plot is to convert the datetime to int64:

In [117]: t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31'), pd.to_datetime('2013-12-31')], 'sample_data': [1, 2]})

In [118]: t['date_int'] = t.date.astype(np.int64)

In [119]: t
Out[119]: 
        date  sample_data             date_int
0 2012-12-31            1  1356912000000000000
1 2013-12-31            2  1388448000000000000

In [120]: t.plot(kind='scatter', x='date_int', y='sample_data')
Out[120]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3c852662d0>

In [121]: plt.show()

enter image description here

Another workaround is (to not use scatter, but ...):

In [126]: t.plot(x='date', y='sample_data', style='.')
Out[126]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3c850f5750>

And, the last work around:

In [141]: import matplotlib.pyplot as plt

In [142]: t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31'), pd.to_datetime('2013-12-31')], 'sample_data': [100, 20000]})

In [143]: t
Out[143]: 
        date  sample_data
0 2012-12-31          100
1 2013-12-31        20000
In [144]: plt.scatter(t.date.dt.to_pydatetime()  , t.sample_data)
Out[144]: <matplotlib.collections.PathCollection at 0x7f3c84a10510>

In [145]: plt.show()

enter image description here

This has an issue at github, which is open as of now.

Upvotes: 4

Related Questions