de1
de1

Reputation: 3124

Retrieving BigQuery validation errors when loading JSONL data via the Python API

How can I retrieve more information relating to the validation errors when loading a JSONL file into BigQuery? (The question is not about solving the issue)

Example code:

from google.cloud.bigquery import (
    LoadJobConfig,
    QueryJobConfig,
    Client,
    SourceFormat,
    WriteDisposition
)

# variables depending on the environment
filename = '...'
gcp_project_id = '...'
dataset_name = '...'
table_name = '...'
schema = [ ... ]

# loading data
client = Client(project=project_id)
dataset_ref = client.dataset(dataset_name)
table_ref = dataset_ref.table(table_name)
job_config = LoadJobConfig()
job_config.source_format = SourceFormat.NEWLINE_DELIMITED_JSON
job_config.write_disposition = WriteDisposition.WRITE_APPEND
job_config.schema = schema
LOGGER.info('loading from %s', filename)
with open(filename, "rb") as source_file:
    job = client.load_table_from_file(
        source_file, destination=table_ref, job_config=job_config
    )

    # Waits for table cloud_data_store to complete
    job.result()

Here I am using bigquery-schema-generator to generate a schema (as BigQuery otherwise only looks at the first 100 rows).

Running that might error with the following error message (google.api_core.exceptions.BadRequest):

400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.

Looking at the errors property basically doesn't provide any new information:

[{'reason': 'invalid',
  'message': 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'}]

I also looked at __dict__ of the exception but that hasn't revealed any further information.

Trying to load the table using the bq command line (in this case without explicit schema) results in a much more helpful message:

BigQuery error in load operation: Error processing job '...': Provided Schema does not match Table <table name>. Field <field name> has changed type from TIMESTAMP to DATE

My question now is how would I be able to retrieve such helpful message from the Python API?

Solution based on accepted answer

Here is a copy and past workaround that one could add in order to show more information by default. (There may be downsides to it)

import google.cloud.exceptions
import google.cloud.bigquery.job


def get_improved_bad_request_exception(
    job: google.cloud.bigquery.job.LoadJob
) -> google.cloud.exceptions.BadRequest:
    errors = job.errors
    result = google.cloud.exceptions.BadRequest(
        '; '.join([error['message'] for error in errors]),
        errors=errors
    )
    result._job = job
    return result


def wait_for_load_job(
    job: google.cloud.bigquery.job.LoadJob
):
    try:
        job.result()
    except google.cloud.exceptions.BadRequest as exc:
        raise get_improved_bad_request_exception(job) from exc

Then calling wait_for_load_job(job) instead of job.result() directly, will result in a more useful exception (the error message and errors property).

Upvotes: 3

Views: 1994

Answers (1)

Ricco D
Ricco D

Reputation: 7287

To be able to show a more helpful error message, you can import google.api_core.exceptions.BadRequest to catch exceptions and then use LoadJob attribute errors to get verbose error messages from the job.

from google.api_core.exceptions import BadRequest
...
...
try:
    load_job.result()# Waits for the job to complete.
except BadRequest:
    for error in load_job.errors:
        print(error["message"])  # error is of type dictionary

For testing I used the sample code BQ load json data and changed the input file to produce an error. In the file I changed the value for "post_abbr" from string to an array value.

File used:

{"name": "Alabama", "post_abbr": "AL"}
{"name": "Alaska", "post_abbr":  "AK"}
{"name": "Arizona", "post_abbr": [65,2]}

See output below when code snippet above is applied. The last error message shows the actual error about "post_abbr" receiving an Array for a non repeated field.

Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 3; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 3; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 78: Array specified for non-repeated field: post_abbr.

Upvotes: 4

Related Questions