Jesús Rojas
Jesús Rojas

Reputation: 3

Pub/Sub messages from snapshot not processed in a Dataflow streaming pipeline

We have a Dataflow consumming from a Pub/Sub and writing into bigquery in streaming. Due to a permits issue the pipeline got stuck and the messages were not consumed, we re-started the pipeline, save the unacked messages in a snapshot, replay the messages but they are discarded

The messages are < 3 days old, does anybody knows why this happen? Thanks in advance

The pipeline is using Apache Beam SDK 2.39.0 and python 3.9 with streming engine and v2 runner enable.

Upvotes: 0

Views: 286

Answers (1)

Bruno Volpato
Bruno Volpato

Reputation: 1428

How long does it take for a Pub/Sub message to process, is it a long process?

In that case, Pub/Sub may redeliver messages, according to subscription configuration/delays. See Subscription retry policy.

Dataflow can work-around that, as it acknowledges from the source after a successful shuffle. If you add a GroupByKey (or artificially, a Reshuffle) transform, it may resolve source duplications.

More information at https://beam.apache.org/contribute/ptransform-style-guide/#performance

Upvotes: 0

Related Questions