Reputation: 3218
I am trying to export a dataframe to a Parquet file, which will be consumed later in the pipeline by something that is not Python or Pandas. (Azure Data Factory)
When I ingest the Parquet file later in the flow, it cannot recognize datetime64[ns]
. I would rather just use "vanilla" Python datetime.datetime
.
But I cannot manage to do this. The problem is that Pandas is forcing any "datetime-like object into datetime64[ns]
once it is back in a dataframe or series.
For instance, assume the iris dataset with a "timestamp"
column:
>>> df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) class timestamp
0 5.1 3.5 1.4 0.2 setosa 2021-02-19 15:07:24.719272
1 4.9 3.0 1.4 0.2 setosa 2021-02-19 15:07:24.719272
2 4.7 3.2 1.3 0.2 setosa 2021-02-19 15:07:24.719272
3 4.6 3.1 1.5 0.2 setosa 2021-02-19 15:07:24.719272
4 5.0 3.6 1.4 0.2 setosa 2021-02-19 15:07:24.719272
>>> df.dtypes
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
class category
timestamp datetime64[ns]
dtype: object
I can convert a value to a "normal Python datetime":
>>> df.timestamp[1]
Timestamp('2021-02-19 15:07:24.719272')
>>> type(df.timestamp[1])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
>>> df.timestamp[1].to_pydatetime()
datetime.datetime(2021, 2, 19, 15, 7, 24, 719272)
>>> type(df.timestamp[1].to_pydatetime())
<class 'datetime.datetime'>
But I cannot "keep" it in that type, when I convert the entire column / series:
>>> df['ts2'] = df.timestamp.apply(lambda x: x.to_pydatetime())
>>> df.dtypes
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
class category
timestamp datetime64[ns]
ts2 datetime64[ns]
I looked to see if there were anything I could do to "dumb down" the dataframe column and make its datetimes less precise. But I cannot see anything. Nor can I see an option to specify column data types upon export via the df.to_parquet()
method.
Is there a way to create a plain Python datetime.datetime
column (not the Numpy/Pandas datetime65[ns]
column) in a Pandas dataframe?
Upvotes: 3
Views: 1741
Reputation: 71
In my case, when I tried to convert datetime64[ns]
to datetime
, I used the function dt.date
and got an object data and not precisely a date data, but it worked:
df[added_column_name] = pd.to_datetime(df['column_name']).dt.date
dfhead()
Now, 'added_column_name' is an object data.
Upvotes: 1
Reputation: 150785
Try to force the dtype='object'
when you use to_pydatetime
:
df['ts'] = pd.Series(df.timestamp.dt.to_pydatetime(),dtype='object')
df.loc[0,'ts']
Output:
datetime.datetime(2021, 2, 19, 15, 7, 24, 719272)
Upvotes: 3