Reputation: 348
I am trying to load data from a csv into a parquet file using pyarrow. I am using the convert options to set the data types to their proper type and then using the timestamp_parsers option to dictate how the timestamp data should be interpreted: please see my "csv" below:
time,data
01-11-19 10:11:56.132,xxx
Please see my code sample below.
import pyarrow as pa
from pyarrow import csv
from pyarrow import parquet
convert_dict = {
'time': pa.timestamp('us', None),
'data': pa.string()
}
convert_options = csv.ConvertOptions(
column_types=convert_dict
, strings_can_be_null=True
, quoted_strings_can_be_null=True
, timestamp_parsers=['%d-%m-%y %H:%M:%S.%f']
)
table = csv.read_csv('test.csv', convert_options=convert_options)
print(table)
parquet.write_table(table, 'test.parquet')
Basically, pyarrow doesn't like some strptime values. Specifically in this case, it does not like "%f" which is for fractional seconds (https://www.geeksforgeeks.org/python-datetime-strptime-function/). Any help to get pyarrow to do what I need would be appreciated.
Just to be clear, I can get the code to run if I edit the data to not have fractional seconds and then remove the "%f" from the timestamp_parsers option. However I need to maintain the integrity of the data so this is not an option. To me it seems like a bug in pyarrow or I'm an idiot and missing something obvious. Open to both options just want to know which it is.
Upvotes: 4
Views: 4111
Reputation: 4514
Workaround based on pyarrow issues discussion
def parse_timestamp_us(
df: pa.Table,
) -> pa.Table:
"""
Hack to combine timestamp seconds + microseconds without the "."
separating the two timestamp chunks because pyarrow doesn't
yet have a clean solution. Based on
https://github.com/apache/arrow/issues/41132#issuecomment-2052555361
"""
last_second_codeunit_idx = 19
timestamp_seconds = pc.strptime(
pc.utf8_slice_codeunits(df["timestamp"], 0, last_second_codeunit_idx),
format=iso_8601_no_t_no_milli_no_z_str,
unit="s",
)
pc.utf8_slice_codeunits(df["timestamp"], last_second_codeunit_idx + 1, 99)
timestamp_microseconds_only_str = pc.utf8_slice_codeunits(
df["timestamp"], last_second_codeunit_idx + 1, 99
)
# convert empty string to 0
timestamp_microseconds_only_str = pc.if_else(
pc.equal(timestamp_microseconds_only_str, ""),
"0",
timestamp_microseconds_only_str,
)
timestamp_microseconds_only = timestamp_microseconds_only_str.cast(pa.int64()).cast(
pa.duration("us")
)
timestamp_microseconds = pc.add(timestamp_seconds, timestamp_microseconds_only)
# pyarrow doesn't handle chaining properly if filters are chained after
# drop/append_column operations.
return df.drop(["timestamp"]).append_column("timestamp", timestamp_microseconds)
Upvotes: 0
Reputation: 159
The parsing issue mentioned by @Rok is unfortunate. This is another workaround to convert ISO 8601 formatted strings to pa.timestamp but not efficient because it's going from arrow -> python list -> arrow (tested on Arrow version 13.0.0):
from datetime import datetime, timezone
import pyarrow as pa
def pyarrow_string_to_ts(col):
time_string_list = col.to_pylist()
timestamps = [datetime.fromisoformat(x).astimezone(timezone.utc) for x in time_string_list]
return pa.array(timestamps)
sample_data = [{'price': '10.2', 'date': '2023-09-25T00:00:00.000000-04:00'}, {'price': '9.7', 'date': '2023-09-26T00:00:00.000000-05:00'}]
table = pa.Table.from_pylist(sample_data)
table = table.set_column(table.schema.get_field_index("date"), "date", pyarrow_string_to_ts(table.column("date")))
Upvotes: 0
Reputation: 348
So I have found that for timestamp data, you should just try to have the data in the default parser format (ISO8601). For example if you convert csv data into parquet using the pyarrow timestamp data type. Just have the csv data in this format:
No time zone
YYYY-MM-DDTHH:MI:SS.FF6
With time zone
YYYY-MM-DDTHH:MI:SS.FF6TZH:TZM
Upvotes: 0
Reputation: 416
%f
is not supported in pyarrow and most likely won't be as it's a Python specific flag. See discussion here: https://issues.apache.org/jira/browse/ARROW-15883 . PRs are of course always welcome!
As a workaround you could first read timestamps as strings, then process them by slicing off the fractional part and add that as pa.duration to processed timestamps:
import pyarrow as pa
import pyarrow.compute as pc
ts = pa.array(["1970-01-01T00:00:59.123456789", "2000-02-29T23:23:23.999999999"], pa.string())
ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 19), format="%Y-%m-%dT%H:%M:%S", unit="ns")
d = pc.utf8_slice_codeunits(ts, 20, 99).cast(pa.int64()).cast(pa.duration("ns"))
pc.add(ts2, d)
Upvotes: 2