Reputation: 79
When I execute below code - gets following error ValueError: Table schema does not match schema used to create file.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
fields = [
('one', pa.int64()),
('two', pa.string(), False),
('three', pa.bool_())
]
schema = pa.schema(fields)
schema = schema.remove_metadata()
df = pd.DataFrame(
{
'one': [2, 2, 2],
'two': ['foo', 'bar', 'baz'],
'three': [True, False, True]
}
)
df['two'] = df['two'].astype(str)
table = pa.Table.from_pandas(df, schema, preserve_index=False).replace_schema_metadata()
writer = pq.ParquetWriter('parquest_user_defined_schema.parquet', schema=schema)
writer.write_table(table)
Upvotes: 6
Views: 11324
Reputation: 139162
This works fine with the latest version of pyarrow (>=0.14.0), but I can confirm I also get an error with pyarrow 0.13).
The reason was the bug in not preserving the nullability of the schema in the conversion from pandas to arrow (see https://issues.apache.org/jira/browse/ARROW-5169).
With pyarrow 0.13:
>>> schema.field_by_name('two').nullable
False
>>> table.schema.field_by_name('two').nullable
True
which made that your specified schema
and the schema of the table passed to write_table
did not match, giving the error you see.
This is fixed in 0.14, and both will give False
in the output above.
So you can either remove the nullable=False
when creating the schema manually, or update to the arrow >= 0.14.
Note that is you are writing a single table to a single parquet file, you don't need to specify the schema manually (you already specified it when converting the pandas DataFrame to arrow Table, and pyarrow will use the schema of the table to write to parquet). So in the simple case, you could also do:
pq.write_table(table, 'parquest_user_defined_schema.parquet')
Additional note: you need a writer.close()
to make your example complete.
Upvotes: 6