Reputation: 10213
We have the following three JSONs with data that should be loaded in the same table:
{ "name": "tom", "customValues": [] }
{ "name": "joe", "customValues": { "member": "1" } }
{ "name": "joe", "customValues": { "year": "2020", "number": "3" } }
We load data with the python bigquery.LoadJobConfig
function:
job_config = bigquery.LoadJobConfig(
schema=SCHEMA_MAP.get(bq_table) if autodetect == False else None,
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE if remove_old == True else bigquery.WriteDisposition.WRITE_APPEND,
autodetect=autodetect
)
SCHEMA_MAP
is a dictionary of arrays, where each array in the schema for one of our tables. We define our BigQuery schema in python using the python bigquery.SchemaField
function. If each of the 3 JSONs above were going into 3 different tables, I would have their table schemas defined as:
SCHEMA_T1 = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "STRING", mode="REPEATED")
]
SCHEMA_T2 = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("member", "STRING")
])
]
SCHEMA_T3 = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("year", "STRING"),
bigquery.SchemaField("number", "STRING")
])
]
Is it possible to define the customValues
column to handle all 3 of these different data types in one single table? How would the schema be defined for this? Currently, if SCHEMA_T1
is used and data in the forms of T2 or T3 is uploaded, the upload fails and it returns an error Error while reading data, error message: JSON parsing error in row starting at position 0: JSON object specified for non-record field: customValues
. Similar errors for the other schemas. Is there a generic any json field in BigQuery that can be used for this?
Upvotes: 2
Views: 2393
Reputation: 1401
As JSON
feature is still in preview
for bigquery (see launch stages). As a workaround, you can use load_table_from_dataframe
from the bigquery client to load data from data columns that might require some refinement before pushing into our working table.
Let's see on your scenario, lets said we have a data.json
file with raw data:
data.json
[
{
"name": "tom",
"customValues": []
},
{
"name": "joe",
"customValues": {
"member": "1"
}
},
{
"name": "joe",
"customValues": {
"year": "2020",
"number": "3"
}
}
]
And we have a single table on bigquery that we need to populate.
create or replace table "my-project.my-dataset.a-table" (
name STRING,
customValues STRING
)
load.py
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
table_id = "project-id.dataset-id.a-table"
df = pd.read_json('data.json')
df["customValues"]= df["customValues"].apply(str)
print(df.shape)
print(df.head())
job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.PARQUET, autodetect=True)
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(table.num_rows, len(table.schema), table_id))
output
| **name** | **customValues** |
|----------|---------------------------------|
| tom | [] |
| joe | {'member': '1'} |
| joe | {'year': '2020', 'number': '3'} |
As you can see, regardless of the structure of customValues
we are able to have that inserted on our working table ( which only have 2 columns ). We load the json data into a dataframe and then we just update the datatype column to fit our column type by using apply
. For more information about using apply
please visit this link.
Upvotes: 2