andre622
andre622

Reputation: 525

Ignore unknown values on append

I am running into a bit of an issue when appending a JSON blob to an existing table in BigQuery; specifically, the existence of new, but not necessary, leaf elements in the JSON being appended is preventing me from appending the data.

Using the UI, I am able to create a table using a JSON file from Google Cloud Storage, and after describing the schema, I can select to "Ignore Unknown Values". From the definition:

Enable to accept rows that contain values that do not match the schema. Unknown values will be ignored.

This works extremely well for the initial table creation and import from GCS, and is a feature I would like to continue to leverage for appends. However, when I try to append data to the table using the bq command line tool, I am getting errors such that individual leaf elements of the JSON blob do not fall into the schema.

Has anyone experienced similar issues? Have a quick solve? I know that I could pre-process the JSON itself to remove the elements that I do not need, but I feel like the "Ignore Unknown Values" should be a command line flag.

Unless someone has a specific reason why this is not implemented, I may file a feature request.

Thanks!

Upvotes: 2

Views: 5001

Answers (2)

Andy Blackwell
Andy Blackwell

Reputation: 298

Use the command line option: --ignore-unknown-values

Use bq load --help to see all your options.

Upvotes: 5

Felipe Hoffa
Felipe Hoffa

Reputation: 59355

My favorite resilient way of importing JSON data to BigQuery: Just import the full JSON string to a one column BigQuery table - parse it later.

I do this with Wikidata, AcousticBrainz, the Wikipedia change log, etc...

Then you can create a view to parse these objects in real time, or materialize it recurrently (for query cost savings).

My Wikipedia change log view, for example:

SELECT 
  JSON_EXTRACT_SCALAR(object, '$[0].wiki') wiki,
  JSON_EXTRACT_SCALAR(object, '$[0].comment') comment,
  JSON_EXTRACT_SCALAR(object, '$[0].server_name') server_name,
  JSON_EXTRACT_SCALAR(object, '$[0].server_script_path') server_script_path,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].namespace')) namespace,
  JSON_EXTRACT_SCALAR(object, '$[0].title') title,
  'true'=JSON_EXTRACT_SCALAR(object, '$[0].bot') bot,
  JSON_EXTRACT_SCALAR(object, '$[0].server_url') server_url,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].length.new')) length_new,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].length.old')) length_old,
  JSON_EXTRACT_SCALAR(object, '$[0].user') user,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].timestamp')) timestamp,
  JSON_EXTRACT_SCALAR(object, '$[0].type') type,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].id')) id,
  'true'=JSON_EXTRACT_SCALAR(object, '$[0].minor') minor,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].revision.new')) revision_new,
  INTEGER(JSON_EXTRACT_SCALAR(object, '$[0].revision.old')) revision_old,
FROM [wikipediaEdits.changelog_objects] 

To load a JSON string without parsing it, I do something like:

bq load --replace -F "tab" \
   fh-bigquery:test_acousticbrainz.lowlevel \
   gs://fh-datalab-musicbrainz/acousticbrainz/acousticbrainz-lowlevel* item

(you just need to import the json file as if it was a .csv, and choose a separator different than comma - tab usually works for me inside a JSON)

Tl;dr: 1. Don't parse when storing. 2. Store the full JSON string. 3. No more data loss!

Upvotes: 4

Related Questions