Siddharth Chabra
Siddharth Chabra

Reputation: 488

Bigquery parquet file treats list<string> as list<int32> when empty array is passed

I have a large nested terabyte sized jsonl(s) which I am converting to parquet files and writing to a have partitioned google cloud storage bucket.

The issue is as follows. One of the nested fields is a list of string ideally the schema for this field I expect is billing_code_modifier: list<item: string>, but there is a rare case the sometimes the length of the list is 0 for all records in which case pandas writes the billing_code_modifier: list<item: null>

This causes an issue since the third party tool [bigquery] which is being used to read these parquet files fail to read these due to inconsistent schema expecting list not list [it defaults empty arrays to int32 , blame google not me]

How does one get around this. Is there a way to specify the schema while writing parquet files. Since I am dealing with a bucket I cannot write an empty parquet and then add the data to the file in 2 separate write operations as GCP does not allow you to modify files only overwrite

Upvotes: 0

Views: 1011

Answers (1)

Micah Kornfield
Micah Kornfield

Reputation: 1718

For Pandas you can specify an Arrow schema as a kwarg which should provide the correct schema. See Pyarrow apply schema when using pandas to_parquet() for details.

Upvotes: 1

Related Questions