Reputation: 66
I've a Cloud Dataflow pipeline that looks like this:
Initially without setting any max workers and num of Workers it work fine but takes long time to process large datasets,Then I specified some maxNumWorkers to say 60 and numWorkers to 6 and it's fine but we lost lot of data in processing end
we also tried this
--autoscaling_algorithm=THROUGHPUT_BASED --max_num_workers=5
still job starts with one worker and does not scale automatically.
However, it does not seems Dataflow workers is spinning up and balance load automatically.
Upvotes: 0
Views: 437
Reputation: 3883
I would like to suggest you to enable the Dataflow Streaming Engine feature, since it provides more responsive autoscaling performance based on CPU utilization for your pipeline compared to the default architecture for Dataflow worker processing and autoscaling.
There is an issue related to the throughput and input behavior of the Cloud Dataflow. You can track the improvements here. Please click on +1
to make it more visible to the Dataflow engineering team.
Additionally, what you can check is whether there is a quota issue on relevant resources. For each job, Dataflow creates an instance group. The worker VMs are started via the instance group, and each worker VM takes resources. All these resources (e.g. instance groups, IP addresses, CPUs, etc) can be restricted by quotas. Follow the documentation. What I've found is similar SO thread with answer from Dataflow Engineer.
I hope you find the above pieces of information useful.
Upvotes: 3