How to debug python dataflow beam.io.WriteToBigQuery

Question

I currently have a python dataflow job whose end sink is a PCollection write to BigQuery. It's failing with the following error:

Workflow failed. Causes: S01:XXXX+XXX+Write/WriteToBigQuery/NativeWrite failed., BigQuery import job "dataflow_job_XXXXXX" failed., BigQuery job "dataflow_job_XXXXXX" in project "XXXXXX" finished with error(s): errorResult: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 19; errors: 1

To get a more detailed error report I'm then running:

bq --format=prettyjson show -j dataflow_job_XXXXXX

which displays something like (there are a bunch of errors this is just one of them):

{

    "location": "gs://XXXXX/XXXXXX/tmp/XXXXX/10002237702794672370/dax-tmp-2019-02-05_20_14_50-18341731408970037725-S01-0-5144bb700f6a9f0b/-shard--try-00d3c2c24d5b0371-endshard.json",

    "message": "Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 11; errors: 1. Please look into the errors[] collection for more details.",                                          

    "reason": "invalid"

  },

I then go looking for the particular shard to see what PCollection row is in error and what I need to do to either filter these rows or fix a bug of mine:

gsutil ls gs://XXXXX/XXXXXX/tmp/XXXXX/10002237702794672370/dax-tmp-2019-02-05_20_14_50-18341731408970037725-S01-0-5144bb700f6a9f0b/-shard--try-00d3c2c24d5b0371-endshard.json

But that command returns:

CommandException: One or more URLs matched no objects.

What are the best practices for debugging jobs (which takes multiple hours btw)? My thought right now is to write the PCollection to GCS in a non-temp location in JSON format and try to ingest it myself.

John Hanley · Accepted Answer

For your type of error, I do the following:

Use a Json check tool to list records with errors.
Run Cloud Dataflow locally.
Add a pipeline step to validate each Json record and remove the bad entries from the pipeline. Use a dead letter file using side output OR log the bad records for debugging.

This article might give you some ideas for handling invalid inputs.

Handling Invalid Inputs in Dataflow

How to debug python dataflow beam.io.WriteToBigQuery

Answers (1)

Related Questions