Reputation: 909
When dataflow streaming job with autoscaling enabled is deployed, it uses single worker. Let's assume that pipeline reads pubsub messages, does some DoFn operations and uploads into BQ. Let's also assume that PubSub queue is already a bit big. So pipeline get started and loads some pubsubs processing them on single worker. After couple of minutes it gets realized that some extra workers are needed and creates them. Many pubsub messages are already loaded and are being processed but not acked yet. And here is my question: how dataflow will manage those unacked yet, being processed elements?
My observations would suggest that dataflow sends many of those already being processed messages to a newly created worker and we can see that the same element is being processed at the same time on two workers. Is this expected behavior?
Another question is - what next? First wins? Or new wins? I mean, we have the same pubsub message that is still being processed on first worker and on the new one. What if process on first worker will be faster and finishes processing? It will be acked and goes downstream or will be drop because new process for this element is on and only new one can be finalized?
Upvotes: 0
Views: 47
Reputation: 171
Dataflow provides exactly-once processing of every record. Funnily enough, this does not mean that user code is run only once per record, whether by the streaming or batch runner.
It might run a given record through a user transform multiple times, or it might even run the same record simultaneously on multiple workers; this is necessary to guarantee at-least once processing in the face of worker failures. Only one of these invocations can “win” and produce output further down the pipeline.
More information here - https://cloud.google.com/blog/products/data-analytics/after-lambda-exactly-once-processing-in-google-cloud-dataflow-part-1
Upvotes: 3