Reputation: 2419
Following the recent incidents where a whole AZ would have been lost to an outage, I would like to understand better the Dataflow failover procedures.
When I manually deleted the worker nodes for a dataflow job (Streaming, PubSub to BigQuery), they had been successfully recreated/restarted, yet the Dataflow process itself had not recovered.
Even though all the statuses were OK, the data items were not flowing.
The only way to restart the flow was to cancel the job and re-submit it again.
Even though I understand that manual deletion is not a valid test, we cannot discount the factor of the human error.
My understanding that the workflow should have restarted automatically, yet it is not the observed case here.
What do I miss?
Upvotes: 1
Views: 194
Reputation: 6023
Dataflow does rely on GCE for resilience to physical failure, so we do not support recovery from manual deletion of a node. Explicit deletion does not simulate a GCE outage, so this will not test the resiliency property you are interested in.
Upvotes: 2