Reputation: 61
I have a parquet file with a date field in it called 'BusinessDate'
. When I import it to a dataframe, it automatically determines the field BusinessDate being a date (datetime64[ns, UTC]
).
However, since the format of this 'BusinessDate'
field is 'YYYY-MM-DD'
, some of these dates were incorrectly imported. E.g. 2013-02-01 should be the first of February 2013 while it actually was interpreted as the second of January 2013.
Is it possible to set the correct format of the 'BusinessDate'
field when importing the parquet file?
Initially I used:
df.read_parquet('data.parquet')
If I had a csv file my solution would be:
custom_date_parser = lambda x: datetime.strptime(x, '%Y-%m-%d')
df.read_csv('data.csv',parse_dates=['BusinessDate'], date_parser=custom_date_parser)
However, when I try a comparable code to try to fix the date issue, it gives an error:
custom_date_parser = lambda x: datetime.strptime(x, '%Y-%m-%d')
df.read_parquet('data.parquet',parse_dates=['BusinessDate'], date_parser=custom_date_parser)
The error is caused due to the fact that the read_parquet
function doesn't have a parse_dates
or date_parser
attribute, read_csv
function does have.
So my question is: How do I import a parquet file in pandas so that the 'BusinessDate'
field is imported correctly as a date in the correct format which is 'YYYY-MM-DD'
in my case. Or in case this not possible with pandas read_parquet
function, is it possible to import the 'BusinessDate'
field as a string field in a pandas.DataFrame
so that I can change it afterwards?
Upvotes: 5
Views: 5503
Reputation: 62493
python 3.11.2
, pandas 2.0.0
pd.to_datetime
, and set the format
parameter, which is the existing format, not the desired format.
'%Y-%d-%m'
→ '%Y-%m-%d'
format
codes can be found at strftime()
and strptime()
Format Codesdf['BusinessDate'] = pd.to_datetime(df['BusinessDate'].astype(str), format='%Y-%d-%m').dt.date
.read_parquet
interprets a parquet date
filed as a datetime
(and adds a time component), use the .dt
accessor to extract only the date
component, and assign it back to the column.df['BusinessDate'] = ['BusinessDate'].dt.date
import pandas as pd
# test data
data = pd.date_range('2023-04-01', freq='20s', periods=50000)
df = pd.DataFrame({'test': data})
# display(df.head())
test
0 2023-04-01 00:00:00
1 2023-04-01 00:00:20
2 2023-04-01 00:00:40
3 2023-04-01 00:01:00
4 2023-04-01 00:01:20
# display(df.tail())
test
49995 2023-04-12 13:45:00
49996 2023-04-12 13:45:20
49997 2023-04-12 13:45:40
49998 2023-04-12 13:46:00
49999 2023-04-12 13:46:20
# convert to string and set the current format
df['test'] = pd.to_datetime(df['test'].astype(str), format='%Y-%d-%m %H:%M:%S').dt.date
# display(df.head())
test
0 2023-01-04
1 2023-01-04
2 2023-01-04
3 2023-01-04
4 2023-01-04
# display(df.tail())
test
49995 2023-12-04
49996 2023-12-04
49997 2023-12-04
49998 2023-12-04
49999 2023-12-04
%timeit
comparisondata = pd.date_range('2023-04-01', freq='20s', periods=50000)
df = pd.DataFrame({'test': data})
# comparison
%timeit pd.to_datetime(df['test'].astype(str), format='%Y-%d-%m %H:%M:%S').dt.date # result is a datetime.date Dtype
%timeit df['test'].apply(lambda x: x.strftime('%Y-%d-%m')) # result is a str
%timeit df['test'].dt.strftime('%Y-%d-%m') # result is a str
162 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
175 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
185 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 139
Great question. Pandas doesn't have a feature for this yet.
Once you read the parquet, I recommend using your lambda function like so:
df['new_col'] = df['col'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
Upvotes: 3