matthewmturner
matthewmturner

Reputation: 606

PyArrow Table: Cast a Struct within a ListArray column to a new schema

I have a parquet file with a struct field in a ListArray column where the data type of a field within the struct changed from an int to float with some new data.

In order to combine the new and old data i had been reading the active & historical parquet files in with pq.read_table and then using pa.concat_table to combine and write the new file.

So to make the schema of the two tables compatible before concatenating i do the following:

active = pq.read_table("path\to\active\parquet")
active_schema = active.schema

hist = pq.read_table("path\to\hist\parquet")
hist = hist.cast(target_schema=active_schema)

combined = pa.concat_tables([active, hist])

But I get the folowing error when casting:

ArrowNotImplementedError: Unsupported cast from struct<code: string, unit_price: struct<amount: int64, currency: string>, line_total: struct<amount: int64, currency: string>, reversal: bool, include_for: list<item: string>, quantity: int64, seats: int64, units: int64, percentage: int64> to struct using function cast_struct

Based on this it seems i wont be able to do the cast.

So my question is, how can I go about merging these datasets / how can I update the schema on the old table? I'm trying to stay within the arrow / parquet ecosystem if possible.

Upvotes: 0

Views: 4552

Answers (1)

joris
joris

Reputation: 139162

Unfortunately casting a struct to a similar struct type but with different field type is not yet implemented (see https://issues.apache.org/jira/browse/ARROW-1888 for the feature request).

I think currently the only possible workaround is to extract the struct column, cast the fields separately, recreate the struct column from that and update the table with that.

A small example of this workflow, starting from the following table with the struct column:

>>> table = pa.table({'col1': [1, 2, 3], 'col2': [{'a': 1, 'b': 2}, None, {'a':3, 'b':4}]})
>>> table
pyarrow.Table
col1: int64
col2: struct<a: int64, b: int64>
  child 0, a: int64
  child 1, b: int64

and assume the following target schema (where one field of the struct column is changed from int to float):

>>> new_schema = pa.schema([('col1', pa.int64()), ('col2', pa.struct([('a', pa.int64()), ('b', pa.float64())]))])
>>> new_schema
col1: int64
col2: struct<a: int64, b: double>
  child 0, a: int64
  child 1, b: double

Then the workaround looks like:

# cast fields separately
struct_col = table["col2"]
new_struct_type = new_schema.field("col2").type
new_fields = [field.cast(typ_field.type) for field, typ_field in zip(struct_col.flatten(), new_struct_type)]
# create new structarray from separate fields
import pyarrow.compute as pc
new_struct_array = pc.make_struct(*new_fields, field_names=[f.name for f in new_struct_type])
# replace the table column with the new array
col_idx = table.schema.get_field_index("col2")
new_table = table.set_column(col_idx, new_schema.field("col2"), new_struct_array)

>>> new_table
pyarrow.Table
col1: int64
col2: struct<a: int64, b: double>
  child 0, a: int64
  child 1, b: double

Upvotes: 2

Related Questions