not_a_comp_scientist
not_a_comp_scientist

Reputation: 348

PyArrow issue with timestamp data

I am trying to load data from a csv into a parquet file using pyarrow. I am using the convert options to set the data types to their proper type and then using the timestamp_parsers option to dictate how the timestamp data should be interpreted: please see my "csv" below:

time,data
01-11-19 10:11:56.132,xxx

Please see my code sample below.

import pyarrow as pa
from pyarrow import csv
from pyarrow import parquet


convert_dict = {
    'time': pa.timestamp('us', None),
    'data': pa.string()
}

convert_options = csv.ConvertOptions(
    column_types=convert_dict
    , strings_can_be_null=True
    , quoted_strings_can_be_null=True
    , timestamp_parsers=['%d-%m-%y %H:%M:%S.%f']
)

table = csv.read_csv('test.csv', convert_options=convert_options)
print(table)
parquet.write_table(table, 'test.parquet')

Basically, pyarrow doesn't like some strptime values. Specifically in this case, it does not like "%f" which is for fractional seconds (https://www.geeksforgeeks.org/python-datetime-strptime-function/). Any help to get pyarrow to do what I need would be appreciated.

Just to be clear, I can get the code to run if I edit the data to not have fractional seconds and then remove the "%f" from the timestamp_parsers option. However I need to maintain the integrity of the data so this is not an option. To me it seems like a bug in pyarrow or I'm an idiot and missing something obvious. Open to both options just want to know which it is.

Upvotes: 4

Views: 4111

Answers (4)

skeller88
skeller88

Reputation: 4514

Workaround based on pyarrow issues discussion

def parse_timestamp_us(
    df: pa.Table,
) -> pa.Table:
    """
    Hack to combine timestamp seconds + microseconds without the "."
    separating the two timestamp chunks because pyarrow doesn't
    yet have a clean solution. Based on
    https://github.com/apache/arrow/issues/41132#issuecomment-2052555361
    """

    last_second_codeunit_idx = 19
    timestamp_seconds = pc.strptime(
        pc.utf8_slice_codeunits(df["timestamp"], 0, last_second_codeunit_idx),
        format=iso_8601_no_t_no_milli_no_z_str,
        unit="s",
    )
    pc.utf8_slice_codeunits(df["timestamp"], last_second_codeunit_idx + 1, 99)
    timestamp_microseconds_only_str = pc.utf8_slice_codeunits(
        df["timestamp"], last_second_codeunit_idx + 1, 99
    )
    # convert empty string to 0
    timestamp_microseconds_only_str = pc.if_else(
        pc.equal(timestamp_microseconds_only_str, ""),
        "0",
        timestamp_microseconds_only_str,
    )
    timestamp_microseconds_only = timestamp_microseconds_only_str.cast(pa.int64()).cast(
        pa.duration("us")
    )
    timestamp_microseconds = pc.add(timestamp_seconds, timestamp_microseconds_only)

    # pyarrow doesn't handle chaining properly if filters are chained after
    # drop/append_column operations.

    return df.drop(["timestamp"]).append_column("timestamp", timestamp_microseconds)

Upvotes: 0

lyoganathan
lyoganathan

Reputation: 159

The parsing issue mentioned by @Rok is unfortunate. This is another workaround to convert ISO 8601 formatted strings to pa.timestamp but not efficient because it's going from arrow -> python list -> arrow (tested on Arrow version 13.0.0):

from datetime import datetime, timezone
import pyarrow as pa

def pyarrow_string_to_ts(col):
  time_string_list = col.to_pylist()
  timestamps = [datetime.fromisoformat(x).astimezone(timezone.utc) for x in time_string_list]
  return pa.array(timestamps)

sample_data = [{'price': '10.2', 'date': '2023-09-25T00:00:00.000000-04:00'}, {'price': '9.7', 'date': '2023-09-26T00:00:00.000000-05:00'}]
table = pa.Table.from_pylist(sample_data)

table = table.set_column(table.schema.get_field_index("date"), "date", pyarrow_string_to_ts(table.column("date")))

Upvotes: 0

not_a_comp_scientist
not_a_comp_scientist

Reputation: 348

So I have found that for timestamp data, you should just try to have the data in the default parser format (ISO8601). For example if you convert csv data into parquet using the pyarrow timestamp data type. Just have the csv data in this format:

No time zone

YYYY-MM-DDTHH:MI:SS.FF6

With time zone

YYYY-MM-DDTHH:MI:SS.FF6TZH:TZM

Upvotes: 0

Rok
Rok

Reputation: 416

%f is not supported in pyarrow and most likely won't be as it's a Python specific flag. See discussion here: https://issues.apache.org/jira/browse/ARROW-15883 . PRs are of course always welcome!

As a workaround you could first read timestamps as strings, then process them by slicing off the fractional part and add that as pa.duration to processed timestamps:

import pyarrow as pa
import pyarrow.compute as pc
ts = pa.array(["1970-01-01T00:00:59.123456789", "2000-02-29T23:23:23.999999999"], pa.string())
ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 19), format="%Y-%m-%dT%H:%M:%S", unit="ns")
d = pc.utf8_slice_codeunits(ts, 20, 99).cast(pa.int64()).cast(pa.duration("ns"))
pc.add(ts2, d)

Upvotes: 2

Related Questions