Reputation: 5862
I’m trying to read a CSV into Pandas, and then write it to Parquet. The challenge is that the CSV has a date column with a value of 3000-12-31, and apparently Pandas has no way to store that value as an actual date. Because of that, PyArrow fails to read the date value.
An example file and code to reproduce is
test.csv
t
3000-12-31
import pandas as pd
import pyarrow as pa
df = pd.read_csv("test.csv", parse_dates=["t"])
schema = pa.schema([pa.field("t", pa.date64())])
table = pa.Table.from_pandas(df, schema=schema)
This gives (a somewhat unhelpful error)
TypeError: an integer is required (got type str)
What's the right way to do this?
Upvotes: 0
Views: 802
Reputation: 139232
Pandas datetime columns (which use the datetime64[ns]
data type) indeed cannot store such dates.
One possible workaround to convert the strings to datetime.datetime
objects in an object dtype column. And then pyarrow should be able to accept them to create a date column.
This conversion could eg be done with dateutil
:
>>> import dateutil
>>> df['t'] = df['t'].apply(dateutil.parser.parse)
>>> df
t
0 3000-12-31 00:00:00
>>> table = pa.Table.from_pandas(df, schema=schema)
>>> table
pyarrow.Table
t: date64[ms]
or if you use a fixed format, using datetime.date.strptime
is probably more reliable:
>>> import datetime
>>> df['t'] = df['t'].apply(lambda s: datetime.datetime.strptime(s, "%Y-%m-%d"))
>>> table = pa.Table.from_pandas(df, schema=schema)
>>> table
pyarrow.Table
t: date64[ms]
Upvotes: 1