Load into BigQuery - one column to handle arbitrary json field (empty array, dict with different fields, etc.)

Question

We have the following three JSONs with data that should be loaded in the same table:

{ "name": "tom", "customValues": [] }
{ "name": "joe", "customValues": { "member": "1" } }
{ "name": "joe", "customValues": { "year": "2020", "number": "3" } }

We load data with the python bigquery.LoadJobConfig function:

job_config = bigquery.LoadJobConfig(
    schema=SCHEMA_MAP.get(bq_table) if autodetect == False else None,
    source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
    write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE if remove_old == True else bigquery.WriteDisposition.WRITE_APPEND,
    autodetect=autodetect
)

SCHEMA_MAP is a dictionary of arrays, where each array in the schema for one of our tables. We define our BigQuery schema in python using the python bigquery.SchemaField function. If each of the 3 JSONs above were going into 3 different tables, I would have their table schemas defined as:

SCHEMA_T1 = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("customValues", "STRING", mode="REPEATED")
]

SCHEMA_T2 = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("customValues", "RECORD", mode="REPEATED", fields=[
        bigquery.SchemaField("member", "STRING")
    ])
]

SCHEMA_T3 = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("customValues", "RECORD", mode="REPEATED", fields=[
        bigquery.SchemaField("year", "STRING"),
        bigquery.SchemaField("number", "STRING")
    ])
]

Is it possible to define the customValues column to handle all 3 of these different data types in one single table? How would the schema be defined for this? Currently, if SCHEMA_T1 is used and data in the forms of T2 or T3 is uploaded, the upload fails and it returns an error Error while reading data, error message: JSON parsing error in row starting at position 0: JSON object specified for non-record field: customValues. Similar errors for the other schemas. Is there a generic any json field in BigQuery that can be used for this?

Cylldby · Accepted Answer

BigQuery now supports JSON as a data type (from January 2022, you are lucky on the timing!), ref here

Therefore you should be able to go with:

SCHEMA_T = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("customValues", "JSON")
]

Load into BigQuery - one column to handle arbitrary json field (empty array, dict with different fields, etc.)

Answers (2)

Related Questions