dendog
dendog

Reputation: 3328

BigQuery streaming from dataflow failing silently

I have had a successful pipeline running streaming data from pub/sub into bigquery using cloud dataflow which is running on a compute instance, rather than an actual dataflow runner.

Today I have updated the BQ table schema, and no new inserts seem to occur. I can view logs on the machine and all is fine - dataflow is not reporting any errors.

Is there any way to access streaming logs from bigquery to check for errors.

EDIT: To summarise my question is whether I am able to get some more verbose logging either from the apache beam SDK or from bigquery to see where this data is ending up.

I have had a look in stackdriver, but this does not seem to create entries for streaming logs.

Upvotes: 2

Views: 1467

Answers (2)

Pablo
Pablo

Reputation: 11021

In versions 2.15 and 2.16, Beam now produces a deadletter PCollection containing all of the rows that failed to be inserted.

This setting is configurable, with the insert_retry_policy parameter. The default for 2.15 and 2.16 is RETRY_ON_TRANSIENT_ERRORS. Starting on 2.17, the default will be RETRY_ALWAYS.

You would do the following:

result = my_collection | WriteToBigQuery(...,
                                         method='STREAMING_INSERTS', ...)

failed_rows = result['FailedRows']  # You can consume this PCollection.

You may also choose to always retry:

result = my_collection | WriteToBigQuery(...,
                                         insert_retry_policy='RETRY_ALWAYS',
                                         method='STREAMING_INSERTS', ...)

This will cause that nothing is output to failed_rows, and your pipeline may rnu forever.

Upvotes: 2

IDMT
IDMT

Reputation: 176

you should be able to get your data stream logs from BigQuery, please take a look on this docs[1][2]. Be aware that modifying the schema of a table can take several minutes to propagate changes, and if it has recently received streaming inserts may respond with schema mismatch errors.

In this case,when BigQuery encounters a schema mismatch on individual rows in the request, none of the rows are inserted and an insertErrors entry is returned for each row including detailed information about the schema mismatch.

[1]https://cloud.google.com/bigquery/troubleshooting-errors#streaming [2]https://cloud.google.com/bigquery/docs/reference/auditlogs/#mapping_audit_entries_to_log_streams

Upvotes: 0

Related Questions