Reputation: 2998
I currently have a python dataflow job whose end sink is a PCollection write to BigQuery. It's failing with the following error:
Workflow failed. Causes: S01:XXXX+XXX+Write/WriteToBigQuery/NativeWrite failed., BigQuery import job "dataflow_job_XXXXXX" failed., BigQuery job "dataflow_job_XXXXXX" in project "XXXXXX" finished with error(s): errorResult: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 19; errors: 1
To get a more detailed error report I'm then running:
bq --format=prettyjson show -j dataflow_job_XXXXXX
which displays something like (there are a bunch of errors this is just one of them):
{
"location": "gs://XXXXX/XXXXXX/tmp/XXXXX/10002237702794672370/dax-tmp-2019-02-05_20_14_50-18341731408970037725-S01-0-5144bb700f6a9f0b/-shard--try-00d3c2c24d5b0371-endshard.json",
"message": "Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 11; errors: 1. Please look into the errors[] collection for more details.",
"reason": "invalid"
},
I then go looking for the particular shard to see what PCollection row is in error and what I need to do to either filter these rows or fix a bug of mine:
gsutil ls gs://XXXXX/XXXXXX/tmp/XXXXX/10002237702794672370/dax-tmp-2019-02-05_20_14_50-18341731408970037725-S01-0-5144bb700f6a9f0b/-shard--try-00d3c2c24d5b0371-endshard.json
But that command returns:
CommandException: One or more URLs matched no objects.
What are the best practices for debugging jobs (which takes multiple hours btw)? My thought right now is to write the PCollection to GCS in a non-temp location in JSON format and try to ingest it myself.
Upvotes: 1
Views: 1418
Reputation: 81336
For your type of error, I do the following:
This article might give you some ideas for handling invalid inputs.
Handling Invalid Inputs in Dataflow
Upvotes: 2