Reputation: 107
I have a database I'm reading from excel as a pandas dataframe, and the dates come in Timestamp dtype
, but I need them to be in np.datetime64
, so that I can make calculations.
I am aware that the function pd.to_datetime()
and the astype(np.datetime64[ns])
method do work. However, I am unable to update my dataframe to yield this datatype, for whatever reason, using the code mentioned above.
I have also tried creating an acessory dataframe from the original one, with just the dates that I wish to update the typing, converting it to np.datetime64
and plugging it back onto the original dataframe:
dfi = df['dates']
dfi = pd.to_datetime(dfi)
df['dates'] = dfi
But still it doesn't work. I have also tried updating values one by one:
arr_i = df.index
for i in range(len(arr_i)):
df.at[arri[l],'dates'].to_datetime64()
Edit
The root problem seems to be that the dtype
of the column gets updated to np.datetime64
, but somehow, when getting single values from within, they still have the dtype = Timestamp
Does anyone have a suggestion of a workaround that is fairly fast?
Upvotes: 1
Views: 2567
Reputation: 880787
Pandas tries to standardize all forms of datetimes by storing them as NumPy datetime64[ns] values when you assign them to a DataFrame. But when you try to access individual datetime64 values, they are returned as Timestamps.
There is a way to prevent this automatic conversion from happening however: Wrap the list of values in a Series of dtype object
:
import numpy as np
import pandas as pd
# create some dates, merely for example
dates = pd.date_range('2000-1-1', periods=10)
# convert the dates to a *list* of datetime64s
arr = list(dates.to_numpy())
# wrap the values you wish to protect in a Series of dtype object.
ser = pd.Series(arr, dtype='object')
# assignment with `df['datetime64s'] = ser` would also work
df = pd.DataFrame({'timestamps': dates,
'datetime64s': ser})
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10 entries, 0 to 9
# Data columns (total 2 columns):
# timestamps 10 non-null datetime64[ns]
# datetime64s 10 non-null object
# dtypes: datetime64[ns](1), object(1)
# memory usage: 240.0+ bytes
print(type(df['timestamps'][0]))
# <class 'pandas._libs.tslibs.timestamps.Timestamp'>
print(type(df['datetime64s'][0]))
# <class 'numpy.datetime64'>
But beware! Although with a little work you can circumvent Pandas' automatic conversion mechanism,
it may not be wise to do this. First, converting a NumPy array to a list is usually a sign you are doing something wrong, since it is bad for performance. Using object
arrays is a bad sign since operations on object arrays are generally much much slower than equivalent operations on arrays of native NumPy dtypes.
You may be looking at an XY problem -- it may be more fruitful to find a way to (1) work with Pandas Timestamps instead of trying to force Pandas to return NumPy datetime64s or (2) work with datetime64 array-likes (e.g. Series of NumPy arrays) instead of handling values individually (which causes the coersion to Timestamps).
Upvotes: 1