Reputation: 113
I need to read integer format nullable date values ('YYYYMMDD') to pandas and then save this pandas dataframe to Parquet as a Date32[Day] format in order for Athena Glue Crawler classifier to recognize that column as a date. The code below does not allow me to save the column to parquet from pandas:
import pandas as pd
dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date
data_df.to_parquet(r'my_path', engine='pyarrow')
I receive this error below:
Traceback (most recent call last):
File "", line 123, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow\array.pxi", line 265, in pyarrow.lib.array
File "pyarrow\array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type datetime.date)
If I move the None
value towards the end of the date list, this will work without any issue and pyarrow would infer the date column as Date32[Day]
. My guess is that since the Pandas column type for dt.date
is object
plus the first value for the column is NaT
(not a time), pyarrow is not able to infer the column as Date32[Day]
from Pandas dataframe or some sample value, it infers the column as Integer
instead. What is a good way to save this dataframe column to parquet as a Date32[Day]
column without sorting the column values? Thanks.
Upvotes: 4
Views: 5167
Reputation: 139232
This was a bug which is fixed in pyarrow 1.0 (https://issues.apache.org/jira/browse/ARROW-842 / https://github.com/apache/arrow/pull/7537). The snippet from above now works fine:
In [2]: dates = [None, "20200710", "20200711", "20200712"]
...: data_df = pd.DataFrame(dates, columns=['date'])
...: data_df['date'] = pd.to_datetime(data_df['date']).dt.date
In [3]: data_df
Out[3]:
date
0 NaT
1 2020-07-10
2 2020-07-11
3 2020-07-12
In [4]: data_df.to_parquet(r'my_path', engine='pyarrow')
In [5]: import pyarrow.parquet as pq
In [6]: pq.read_table(r'my_path')
Out[6]:
pyarrow.Table
date: date32[day]
Upvotes: 1
Reputation: 615
You are right. As the first value is NaT, you need to remove it without changing the datatype. I used the below code.
import pandas as pd
dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date
# In addition, add this line to remove NaT without changing type
# Change strfttime as you want (I have used YMD)
data_df['date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in data_df['date']]
data_df.to_parquet(r'my_path', engine='pyarrow')
I hope this works for you and the error is solved.
Upvotes: 1