Parse CSV with far future dates to Parquet

Question

I’m trying to read a CSV into Pandas, and then write it to Parquet. The challenge is that the CSV has a date column with a value of 3000-12-31, and apparently Pandas has no way to store that value as an actual date. Because of that, PyArrow fails to read the date value.

An example file and code to reproduce is

test.csv

t
3000-12-31

import pandas as pd
import pyarrow as pa
df = pd.read_csv("test.csv", parse_dates=["t"])
schema = pa.schema([pa.field("t", pa.date64())])
table = pa.Table.from_pandas(df, schema=schema)

This gives (a somewhat unhelpful error)

TypeError: an integer is required (got type str)

What's the right way to do this?

joris · Accepted Answer

Pandas datetime columns (which use the datetime64[ns] data type) indeed cannot store such dates.

One possible workaround to convert the strings to datetime.datetime objects in an object dtype column. And then pyarrow should be able to accept them to create a date column. This conversion could eg be done with dateutil:

>>> import dateutil
>>> df['t'] = df['t'].apply(dateutil.parser.parse)
>>> df
                     t
0  3000-12-31 00:00:00

>>> table = pa.Table.from_pandas(df, schema=schema)
>>> table
pyarrow.Table
t: date64[ms]

or if you use a fixed format, using datetime.date.strptime is probably more reliable:

>>> import datetime
>>> df['t'] = df['t'].apply(lambda s: datetime.datetime.strptime(s, "%Y-%m-%d"))
>>> table = pa.Table.from_pandas(df, schema=schema)
>>> table
pyarrow.Table
t: date64[ms]

Parse CSV with far future dates to Parquet

Answers (1)

Related Questions