Reputation: 4191
I have a pipeline which comprises of Three steps. In First step which is a ParDo which accepts 5 urls in a PCollection. And each of the 5 items generate thousands of urls each and output it. So input of second step is another PCollection which can be of size 100-400k. In the last step the scraped output of each url is saved to a storage service.
I have noticed that First step which generates the url list out of 5 input urls got allocated 5 workers and generates new set of urls. But once the first step is completed no of workers get reduced and reach 1. And while second step is running it's only running in 1 worker (with 1 worker my dataflow is runing for last 2 days So by looking at the logs I am making a logical assumption that the first step is completed).
So my question is eventhough the size of the PCollection is big why it's not split between workers or why more workers are not getting allocated ? Step 2 is a simple web scraper which scrape the given url and output a string. Which is then saved to a storage service
Upvotes: 0
Views: 48
Reputation: 2024
Dataflow tries to connect steps together to create fused steps. So even though you have few ParDo
s in your pipeline, they'll be fused together and will be executed as a single step.
Also once fused, scaling of Dataflow is limited by the step at the beginning of the fused step.
I suspect you have a Create
transform that consist of few elements at the top of your pipeline. In this case Dataflow can only scale up to number of elements in this Create
transform.
One way to prevent this behavior is the break fusion after one (or more) of your high fanout ParDo
transforms. This can be done by adding a Reshuffle.viaRandomKey() transform after it (which contains a GroupByKey
). Given that Reshuffle
is an identity transform, your pipeline should not require additional changes.
See here for more information regarding fusion and ways to prevent it.
Upvotes: 2