Reputation: 976
My streaming dataflow pipeline, which pulls data from PubSub, doesn't write away to BigQuery and logs no errors. The elements go into the node "Write to BigQuery/StreamingInserts/StreamingWriteTables/Reshuffle/GroupByKey":
which is created implicitly like this:
PCollection<TableRow> rows = ...;
rows.apply("Write to BigQuery",
BigQueryIO.writeTableRows().to(poptions.getOutputTableName())
.withSchema(...)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
.withFailedInsertRetryPolicy(InsertRetryPolicy.alwaysRetry())
.withExtendedErrorInfo());
But elements never leave it, or at least not within the system lag which is now 45min. This is supposed to be a streaming job - how can I make it flush and write the data? This is beam version 2.13.0. Thank you.
UPDATE - StackDriver log (no errors) for the step to write data to BigQuery:
I can also add that this works if I use the DirectRunner in the cloud (but only for small amounts of data) and either runner if I insert row by row using the java interface to BigQuery (but that is at least two orders of magnitude too slow to start with).
Upvotes: 2
Views: 841
Reputation: 478
You might try changing your retry policy to InsertRetryPolicy.retryTransientErrors()
. The alwaysRetry()
policy will cause the pipeline to appear to stop making progress if there is some configuration error, for example the BigQuery table not existing or not having permission to access it. The failures are always retried so they are never reported as failures.
You can also check the worker logs in Stackdriver Logging. Do this by clicking on the "Stackdriver" link in the upper corner of the step log pane. Full directions are in the Dataflow logging documentation.
Upvotes: 1