Sachin Jain
Sachin Jain

Reputation: 79

How to write Parquet with user defined schema through pyarrow

When I execute below code - gets following error ValueError: Table schema does not match schema used to create file.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


fields = [
    ('one', pa.int64()),
    ('two', pa.string(), False),
    ('three', pa.bool_())
]
schema = pa.schema(fields)

schema = schema.remove_metadata()
df = pd.DataFrame(
    {
        'one': [2, 2, 2],
        'two': ['foo', 'bar', 'baz'],
        'three': [True, False, True]
    }
)

df['two'] = df['two'].astype(str)

table = pa.Table.from_pandas(df, schema, preserve_index=False).replace_schema_metadata()
writer = pq.ParquetWriter('parquest_user_defined_schema.parquet', schema=schema)
writer.write_table(table)

Upvotes: 6

Views: 11324

Answers (1)

joris
joris

Reputation: 139162

This works fine with the latest version of pyarrow (>=0.14.0), but I can confirm I also get an error with pyarrow 0.13).

The reason was the bug in not preserving the nullability of the schema in the conversion from pandas to arrow (see https://issues.apache.org/jira/browse/ARROW-5169).

With pyarrow 0.13:

>>> schema.field_by_name('two').nullable
False

>>> table.schema.field_by_name('two').nullable
True

which made that your specified schema and the schema of the table passed to write_table did not match, giving the error you see.
This is fixed in 0.14, and both will give False in the output above.

So you can either remove the nullable=False when creating the schema manually, or update to the arrow >= 0.14.


Note that is you are writing a single table to a single parquet file, you don't need to specify the schema manually (you already specified it when converting the pandas DataFrame to arrow Table, and pyarrow will use the schema of the table to write to parquet). So in the simple case, you could also do:

pq.write_table(table, 'parquest_user_defined_schema.parquet')

Additional note: you need a writer.close() to make your example complete.

Upvotes: 6

Related Questions