How does dataflow manage current processes during upscaling streaming job?

Question

When dataflow streaming job with autoscaling enabled is deployed, it uses single worker. Let's assume that pipeline reads pubsub messages, does some DoFn operations and uploads into BQ. Let's also assume that PubSub queue is already a bit big. So pipeline get started and loads some pubsubs processing them on single worker. After couple of minutes it gets realized that some extra workers are needed and creates them. Many pubsub messages are already loaded and are being processed but not acked yet. And here is my question: how dataflow will manage those unacked yet, being processed elements?

My observations would suggest that dataflow sends many of those already being processed messages to a newly created worker and we can see that the same element is being processed at the same time on two workers. Is this expected behavior?

Another question is - what next? First wins? Or new wins? I mean, we have the same pubsub message that is still being processed on first worker and on the new one. What if process on first worker will be faster and finishes processing? It will be acked and goes downstream or will be drop because new process for this element is on and only new one can be finalized?

How does dataflow manage current processes during upscaling streaming job?

Answers (1)

Related Questions