Reputation: 3124
How can I retrieve more information relating to the validation errors when loading a JSONL file into BigQuery? (The question is not about solving the issue)
Example code:
from google.cloud.bigquery import (
LoadJobConfig,
QueryJobConfig,
Client,
SourceFormat,
WriteDisposition
)
# variables depending on the environment
filename = '...'
gcp_project_id = '...'
dataset_name = '...'
table_name = '...'
schema = [ ... ]
# loading data
client = Client(project=project_id)
dataset_ref = client.dataset(dataset_name)
table_ref = dataset_ref.table(table_name)
job_config = LoadJobConfig()
job_config.source_format = SourceFormat.NEWLINE_DELIMITED_JSON
job_config.write_disposition = WriteDisposition.WRITE_APPEND
job_config.schema = schema
LOGGER.info('loading from %s', filename)
with open(filename, "rb") as source_file:
job = client.load_table_from_file(
source_file, destination=table_ref, job_config=job_config
)
# Waits for table cloud_data_store to complete
job.result()
Here I am using bigquery-schema-generator to generate a schema (as BigQuery otherwise only looks at the first 100 rows).
Running that might error with the following error message (google.api_core.exceptions.BadRequest
):
400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Looking at the errors
property basically doesn't provide any new information:
[{'reason': 'invalid',
'message': 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'}]
I also looked at __dict__
of the exception but that hasn't revealed any further information.
Trying to load the table using the bq
command line (in this case without explicit schema) results in a much more helpful message:
BigQuery error in load operation: Error processing job '...': Provided Schema does not match Table <table name>. Field <field name> has changed type from TIMESTAMP to DATE
My question now is how would I be able to retrieve such helpful message from the Python API?
Here is a copy and past workaround that one could add in order to show more information by default. (There may be downsides to it)
import google.cloud.exceptions
import google.cloud.bigquery.job
def get_improved_bad_request_exception(
job: google.cloud.bigquery.job.LoadJob
) -> google.cloud.exceptions.BadRequest:
errors = job.errors
result = google.cloud.exceptions.BadRequest(
'; '.join([error['message'] for error in errors]),
errors=errors
)
result._job = job
return result
def wait_for_load_job(
job: google.cloud.bigquery.job.LoadJob
):
try:
job.result()
except google.cloud.exceptions.BadRequest as exc:
raise get_improved_bad_request_exception(job) from exc
Then calling wait_for_load_job(job)
instead of job.result()
directly, will result in a more useful exception (the error message and errors
property).
Upvotes: 3
Views: 1994
Reputation: 7287
To be able to show a more helpful error message, you can import google.api_core.exceptions.BadRequest
to catch exceptions and then use LoadJob attribute errors to get verbose error messages from the job.
from google.api_core.exceptions import BadRequest
...
...
try:
load_job.result()# Waits for the job to complete.
except BadRequest:
for error in load_job.errors:
print(error["message"]) # error is of type dictionary
For testing I used the sample code BQ load json data and changed the input file to produce an error. In the file I changed the value for "post_abbr"
from string to an array value.
File used:
{"name": "Alabama", "post_abbr": "AL"}
{"name": "Alaska", "post_abbr": "AK"}
{"name": "Arizona", "post_abbr": [65,2]}
See output below when code snippet above is applied. The last error message shows the actual error about "post_abbr"
receiving an Array for a non repeated field.
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 3; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 3; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 78: Array specified for non-repeated field: post_abbr.
Upvotes: 4