dbustosp
dbustosp

Reputation: 4458

Recover PubSub Acked messages from Dataflow after a region loss

I have been reading about how DataFlow ack messages when reading data in streaming. Based on the answers here and here, seems like DataFlow 'ack' the messages by bundle, as long as it finishes the bundle, then it will 'ack' the messages in it.

The confusion n is what will happen when there is a GroupByKeyinvolved in the pipeline. The data in the bundle will be persisted to a multi-regional bucket and the messages will be acknowledged. Then imagine the whole region goes down. The intermediate data will still be in the bucket (because us multi-regional).

That being said,

  1. What are the steps to follow in order to not loose any data?
  2. Any recommendation around how to handle this active/active approach in order to not loose data when a region is completely down?

Please advise,

Upvotes: 2

Views: 842

Answers (1)

Jeff Klukas
Jeff Klukas

Reputation: 1357

With Dataflow and the current implementation of PubSubIO, achieving at-least-once delivery depends on checkpointed state being available. You must always drain your pipeline when cancelling; otherwise, checkpointed state may be lost. If a whole region became unavailable and you needed to start up the job in another region, I believe this would be equivalent to having the pipeline cancelled without draining.

We have several simple streaming Dataflow pipelines that read from PubSub and write to PubSub without ever invoking a GroupByKey, so no checkpoint state is involved and messages are only ack'd after being delivered to the output topic.

We have other pipelines that read from Pubsub and write to GCS or BigQuery. FileIO and BigQueryIO both include several GroupByKey operations, so we are vulnerable to data loss is checkpointed messages are dropped. We have had several occasions where these pipelines have gotten into an unrecoverable state that required cancelling. In those scenarios, we had to backfill a portion of data from an earlier stage of our data architecture.

At this point, Beam does not offer a solution for delaying acks of Pubsub messages across a GroupByKey, so you need to either accept that risk and build operational workflows that can recover from lost checkpointed state or work around the issue by sinking messages to a different data store outside of Beam.

Upvotes: 5

Related Questions