Reputation: 3
We have a Dataflow consumming from a Pub/Sub and writing into bigquery in streaming. Due to a permits issue the pipeline got stuck and the messages were not consumed, we re-started the pipeline, save the unacked messages in a snapshot, replay the messages but they are discarded
We fix the problem, re-deployed the pipeline with a new subscription to the topic and all the events are consumed in streaming without a problem
For all the unacked messages accumulated (20M) in the first subscription, we created a snapshot
This snapshot was then connected to the new subscription via the UI using Replay messages dialog
In the metrics dashboard we see that the unacked messages spike to 20M and then they get consumed subscription spike
But then the events are not sent to BigQuery, checking inside dataflow job metrics we are able to see a spike in the Duplicate message count within the read from pubsub step Dataflow Duplicate counter
The messages are < 3 days old, does anybody knows why this happen? Thanks in advance
The pipeline is using Apache Beam SDK 2.39.0 and python 3.9 with streming engine and v2 runner enable.
Upvotes: 0
Views: 286
Reputation: 1428
How long does it take for a Pub/Sub message to process, is it a long process?
In that case, Pub/Sub may redeliver messages, according to subscription configuration/delays. See Subscription retry policy.
Dataflow can work-around that, as it acknowledges from the source after a successful shuffle. If you add a GroupByKey (or artificially, a Reshuffle) transform, it may resolve source duplications.
More information at https://beam.apache.org/contribute/ptransform-style-guide/#performance
Upvotes: 0