Drain DataFlow job and start another one right after, cause to message duplication

Question

I have a dataflow job, that subscribed to messages from PubSub:

p.apply("pubsub-topic-read", PubsubIO.readMessagesWithAttributes()


.fromSubscription(options.getPubSubSubscriptionName()).withIdAttribute("uuid"))

I see in docs that there is no guarantee for no duplication, and Beam suggests to use withIdAttribute.

This works perfectly until I drain an existing job, wait for it to be finished and restart another one, then I see millions of duplicate BigQuery records, (my job writes PubSub messages to BigQuery).

Any idea what I'm doing wrong?

Drain DataFlow job and start another one right after, cause to message duplication

Answers (1)

Related Questions