Ignore unknown values on append

Question

I am running into a bit of an issue when appending a JSON blob to an existing table in BigQuery; specifically, the existence of new, but not necessary, leaf elements in the JSON being appended is preventing me from appending the data.

Using the UI, I am able to create a table using a JSON file from Google Cloud Storage, and after describing the schema, I can select to "Ignore Unknown Values". From the definition:

Enable to accept rows that contain values that do not match the schema. Unknown values will be ignored.

This works extremely well for the initial table creation and import from GCS, and is a feature I would like to continue to leverage for appends. However, when I try to append data to the table using the bq command line tool, I am getting errors such that individual leaf elements of the JSON blob do not fall into the schema.

Has anyone experienced similar issues? Have a quick solve? I know that I could pre-process the JSON itself to remove the elements that I do not need, but I feel like the "Ignore Unknown Values" should be a command line flag.

Unless someone has a specific reason why this is not implemented, I may file a feature request.

Thanks!

Felipe Hoffa · Accepted Answer

My favorite resilient way of importing JSON data to BigQuery: Just import the full JSON string to a one column BigQuery table - parse it later.

I do this with Wikidata, AcousticBrainz, the Wikipedia change log, etc...

Then you can create a view to parse these objects in real time, or materialize it recurrently (for query cost savings).

My Wikipedia change log view, for example:

SELECT 
  JSON_EXTRACT_SCALAR(object, '$[0].wiki') wiki,
  JSON_EXTRACT_SCALAR(object, '$[0].comment') comment,
  JSON_EXTRACT_SCALAR(object, '$[0].server_name') server_name,
  JSON_EXTRACT_SCALAR(object, '$[0].server_script_path') server_script_path,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].namespace')) namespace,
  JSON_EXTRACT_SCALAR(object, '$[0].title') title,
  'true'=JSON_EXTRACT_SCALAR(object, '$[0].bot') bot,
  JSON_EXTRACT_SCALAR(object, '$[0].server_url') server_url,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].length.new')) length_new,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].length.old')) length_old,
  JSON_EXTRACT_SCALAR(object, '$[0].user') user,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].timestamp')) timestamp,
  JSON_EXTRACT_SCALAR(object, '$[0].type') type,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].id')) id,
  'true'=JSON_EXTRACT_SCALAR(object, '$[0].minor') minor,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].revision.new')) revision_new,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].revision.old')) revision_old,
FROM [wikipediaEdits.changelog_objects]

To load a JSON string without parsing it, I do something like:

bq load --replace -F "tab" \
   fh-bigquery:test_acousticbrainz.lowlevel \
   gs://fh-datalab-musicbrainz/acousticbrainz/acousticbrainz-lowlevel* item

(you just need to import the json file as if it was a .csv, and choose a separator different than comma - tab usually works for me inside a JSON)

Tl;dr: 1. Don't parse when storing. 2. Store the full JSON string. 3. No more data loss!

Ignore unknown values on append

Answers (2)

Related Questions