Reputation: 488
I have a large nested terabyte sized jsonl(s) which I am converting to parquet files and writing to a have partitioned google cloud storage bucket.
The issue is as follows. One of the nested fields is a list of string
ideally the schema for this field I expect is billing_code_modifier: list<item: string>
, but there is a rare case the sometimes the length of the list is 0 for all records in which case pandas writes the billing_code_modifier: list<item: null>
This causes an issue since the third party tool [bigquery] which is being used to read these parquet files fail to read these due to inconsistent schema expecting list not list [it defaults empty arrays to int32 , blame google not me]
How does one get around this. Is there a way to specify the schema while writing parquet files. Since I am dealing with a bucket I cannot write an empty parquet and then add the data to the file in 2 separate write operations as GCP does not allow you to modify files only overwrite
Upvotes: 0
Views: 1011
Reputation: 1718
For Pandas you can specify an Arrow schema as a kwarg which should provide the correct schema. See Pyarrow apply schema when using pandas to_parquet() for details.
Upvotes: 1