Reputation: 606
I have a parquet file with a struct field in a ListArray column where the data type of a field within the struct changed from an int to float with some new data.
In order to combine the new and old data i had been reading the active & historical parquet files in with pq.read_table
and then using pa.concat_table
to combine and write the new file.
So to make the schema of the two tables compatible before concatenating i do the following:
active = pq.read_table("path\to\active\parquet")
active_schema = active.schema
hist = pq.read_table("path\to\hist\parquet")
hist = hist.cast(target_schema=active_schema)
combined = pa.concat_tables([active, hist])
But I get the folowing error when casting:
ArrowNotImplementedError: Unsupported cast from struct<code: string, unit_price: struct<amount: int64, currency: string>, line_total: struct<amount: int64, currency: string>, reversal: bool, include_for: list<item: string>, quantity: int64, seats: int64, units: int64, percentage: int64> to struct using function cast_struct
Based on this it seems i wont be able to do the cast.
So my question is, how can I go about merging these datasets / how can I update the schema on the old table? I'm trying to stay within the arrow / parquet ecosystem if possible.
Upvotes: 0
Views: 4552
Reputation: 139162
Unfortunately casting a struct to a similar struct type but with different field type is not yet implemented (see https://issues.apache.org/jira/browse/ARROW-1888 for the feature request).
I think currently the only possible workaround is to extract the struct column, cast the fields separately, recreate the struct column from that and update the table with that.
A small example of this workflow, starting from the following table with the struct column:
>>> table = pa.table({'col1': [1, 2, 3], 'col2': [{'a': 1, 'b': 2}, None, {'a':3, 'b':4}]})
>>> table
pyarrow.Table
col1: int64
col2: struct<a: int64, b: int64>
child 0, a: int64
child 1, b: int64
and assume the following target schema (where one field of the struct column is changed from int to float):
>>> new_schema = pa.schema([('col1', pa.int64()), ('col2', pa.struct([('a', pa.int64()), ('b', pa.float64())]))])
>>> new_schema
col1: int64
col2: struct<a: int64, b: double>
child 0, a: int64
child 1, b: double
Then the workaround looks like:
# cast fields separately
struct_col = table["col2"]
new_struct_type = new_schema.field("col2").type
new_fields = [field.cast(typ_field.type) for field, typ_field in zip(struct_col.flatten(), new_struct_type)]
# create new structarray from separate fields
import pyarrow.compute as pc
new_struct_array = pc.make_struct(*new_fields, field_names=[f.name for f in new_struct_type])
# replace the table column with the new array
col_idx = table.schema.get_field_index("col2")
new_table = table.set_column(col_idx, new_schema.field("col2"), new_struct_array)
>>> new_table
pyarrow.Table
col1: int64
col2: struct<a: int64, b: double>
child 0, a: int64
child 1, b: double
Upvotes: 2