Reputation: 149
In Apache Beam / Dataflow, I'm curious as to the relationship between number of cores across the entire job as it pertains to the number of worker VMs.
Specifically, when would a job benefit from "going wide" (many workers with lower core count per worker) or "going big" (fewer workers with a higher core count per worker)? Does this effect parallelism?
Example: 8 64 core workers vs 128 4 core workers? Both configurations result in 512 cores for the job. It's unclear to me which I should opt for, especially now that the shuffle service allows us to deploy workers with much smaller disks.
Thanks
Upvotes: 0
Views: 184
Reputation: 5104
Dataflow considers parallelism in number of cores, rather than the number of machines, so from its point of view 8 64-core machines or 128 4-core machines are the same. There can be some benefits for using larger machines for some pipelines, e.g. if workers can "share" common data structures like a large ML model, a slight reduction in startup overhead, and for large jobs there is by default a limit of 1000 machines of any size, but for most jobs it doesn't really matter.
Upvotes: 1