Bigquery parquet file treats list as list when empty array is passed

Question

I have a large nested terabyte sized jsonl(s) which I am converting to parquet files and writing to a have partitioned google cloud storage bucket.

The issue is as follows. One of the nested fields is a list of string ideally the schema for this field I expect is billing_code_modifier: list, but there is a rare case the sometimes the length of the list is 0 for all records in which case pandas writes the billing_code_modifier: list

This causes an issue since the third party tool [bigquery] which is being used to read these parquet files fail to read these due to inconsistent schema expecting list not list [it defaults empty arrays to int32 , blame google not me]

How does one get around this. Is there a way to specify the schema while writing parquet files. Since I am dealing with a bucket I cannot write an empty parquet and then add the data to the file in 2 separate write operations as GCP does not allow you to modify files only overwrite

Bigquery parquet file treats list<string> as list<int32> when empty array is passed

Answers (1)

Related Questions

Bigquery parquet file treats list&lt;string&gt; as list&lt;int32&gt; when empty array is passed

Answers (1)

Related Questions

Bigquery parquet file treats list<string> as list<int32> when empty array is passed