No of items in PCollection is not affecting allocated no of workers

Question

I have a pipeline which comprises of Three steps. In First step which is a ParDo which accepts 5 urls in a PCollection. And each of the 5 items generate thousands of urls each and output it. So input of second step is another PCollection which can be of size 100-400k. In the last step the scraped output of each url is saved to a storage service.

I have noticed that First step which generates the url list out of 5 input urls got allocated 5 workers and generates new set of urls. But once the first step is completed no of workers get reduced and reach 1. And while second step is running it's only running in 1 worker (with 1 worker my dataflow is runing for last 2 days So by looking at the logs I am making a logical assumption that the first step is completed).

So my question is eventhough the size of the PCollection is big why it's not split between workers or why more workers are not getting allocated ? Step 2 is a simple web scraper which scrape the given url and output a string. Which is then saved to a storage service

No of items in PCollection is not affecting allocated no of workers

Answers (1)

Related Questions