J_Py
J_Py

Reputation: 61

pandas.read_parquet incorrectly interprets the date field

I have a parquet file with a date field in it called 'BusinessDate'. When I import it to a dataframe, it automatically determines the field BusinessDate being a date (datetime64[ns, UTC]).

However, since the format of this 'BusinessDate' field is 'YYYY-MM-DD', some of these dates were incorrectly imported. E.g. 2013-02-01 should be the first of February 2013 while it actually was interpreted as the second of January 2013.

Is it possible to set the correct format of the 'BusinessDate' field when importing the parquet file?

Initially I used:

df.read_parquet('data.parquet')

If I had a csv file my solution would be:

custom_date_parser = lambda x: datetime.strptime(x, '%Y-%m-%d')
df.read_csv('data.csv',parse_dates=['BusinessDate'], date_parser=custom_date_parser)

However, when I try a comparable code to try to fix the date issue, it gives an error:

custom_date_parser = lambda x: datetime.strptime(x, '%Y-%m-%d')
df.read_parquet('data.parquet',parse_dates=['BusinessDate'], date_parser=custom_date_parser)

The error is caused due to the fact that the read_parquet function doesn't have a parse_dates or date_parser attribute, read_csv function does have.

So my question is: How do I import a parquet file in pandas so that the 'BusinessDate' field is imported correctly as a date in the correct format which is 'YYYY-MM-DD' in my case. Or in case this not possible with pandas read_parquet function, is it possible to import the 'BusinessDate' field as a string field in a pandas.DataFrame so that I can change it afterwards?

Upvotes: 5

Views: 5503

Answers (2)

Trenton McKinney
Trenton McKinney

Reputation: 62493

df['BusinessDate'] = pd.to_datetime(df['BusinessDate'].astype(str), format='%Y-%d-%m').dt.date
  • If .read_parquet interprets a parquet date filed as a datetime (and adds a time component), use the .dt accessor to extract only the date component, and assign it back to the column.
df['BusinessDate'] = ['BusinessDate'].dt.date

import pandas as pd

# test data
data = pd.date_range('2023-04-01', freq='20s', periods=50000)
df = pd.DataFrame({'test': data})

# display(df.head())
                 test
0 2023-04-01 00:00:00
1 2023-04-01 00:00:20
2 2023-04-01 00:00:40
3 2023-04-01 00:01:00
4 2023-04-01 00:01:20

# display(df.tail())
                     test
49995 2023-04-12 13:45:00
49996 2023-04-12 13:45:20
49997 2023-04-12 13:45:40
49998 2023-04-12 13:46:00
49999 2023-04-12 13:46:20

# convert to string and set the current format
df['test'] = pd.to_datetime(df['test'].astype(str), format='%Y-%d-%m %H:%M:%S').dt.date

# display(df.head())
         test
0  2023-01-04
1  2023-01-04
2  2023-01-04
3  2023-01-04
4  2023-01-04

# display(df.tail())
             test
49995  2023-12-04
49996  2023-12-04
49997  2023-12-04
49998  2023-12-04
49999  2023-12-04

%timeit comparison

data = pd.date_range('2023-04-01', freq='20s', periods=50000)
df = pd.DataFrame({'test': data})

# comparison
%timeit pd.to_datetime(df['test'].astype(str), format='%Y-%d-%m %H:%M:%S').dt.date  # result is a datetime.date Dtype
%timeit df['test'].apply(lambda x: x.strftime('%Y-%d-%m'))  # result is a str
%timeit df['test'].dt.strftime('%Y-%d-%m')  # result is a str

output

162 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
175 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
185 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

KevinG
KevinG

Reputation: 139

Great question. Pandas doesn't have a feature for this yet.

Once you read the parquet, I recommend using your lambda function like so:

df['new_col'] = df['col'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

Upvotes: 3

Related Questions