Yun Ling
Yun Ling

Reputation: 113

Save date column with NAT(null) from pandas to parquet

I need to read integer format nullable date values ('YYYYMMDD') to pandas and then save this pandas dataframe to Parquet as a Date32[Day] format in order for Athena Glue Crawler classifier to recognize that column as a date. The code below does not allow me to save the column to parquet from pandas:

import pandas as pd

dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date
data_df.to_parquet(r'my_path', engine='pyarrow')

I receive this error below:

Traceback (most recent call last):
  File "", line 123, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow\array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type datetime.date)

If I move the None value towards the end of the date list, this will work without any issue and pyarrow would infer the date column as Date32[Day]. My guess is that since the Pandas column type for dt.date is object plus the first value for the column is NaT (not a time), pyarrow is not able to infer the column as Date32[Day] from Pandas dataframe or some sample value, it infers the column as Integer instead. What is a good way to save this dataframe column to parquet as a Date32[Day] column without sorting the column values? Thanks.

Upvotes: 4

Views: 5167

Answers (2)

joris
joris

Reputation: 139232

This was a bug which is fixed in pyarrow 1.0 (https://issues.apache.org/jira/browse/ARROW-842 / https://github.com/apache/arrow/pull/7537). The snippet from above now works fine:

In [2]: dates = [None, "20200710", "20200711", "20200712"] 
   ...: data_df = pd.DataFrame(dates, columns=['date']) 
   ...: data_df['date'] = pd.to_datetime(data_df['date']).dt.date                                                                                                                                                  

In [3]: data_df                                                                                                                                                                                                    
Out[3]: 
         date
0         NaT
1  2020-07-10
2  2020-07-11
3  2020-07-12

In [4]: data_df.to_parquet(r'my_path', engine='pyarrow')                                                                                                                                                           

In [5]: import pyarrow.parquet as pq                                                                                                                                                                               

In [6]: pq.read_table(r'my_path')                                                                                                                                                                                  
Out[6]: 
pyarrow.Table
date: date32[day]

Upvotes: 1

Abhay
Abhay

Reputation: 615

You are right. As the first value is NaT, you need to remove it without changing the datatype. I used the below code.

import pandas as pd

dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date

# In addition, add this line to remove NaT without changing type
# Change strfttime as you want (I have used YMD)
data_df['date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in data_df['date']]

data_df.to_parquet(r'my_path', engine='pyarrow')

I hope this works for you and the error is solved.

Upvotes: 1

Related Questions