Reputation: 478
I've a Cloud Dataflow pipeline that looks like this:
Initially without setting any max workers, dataflow autoscaling would scale to max(1000 nodes) and would put a LOT of stress on our Bigtable cluster. Then I specified some maxNumWorkers to say 100 and it's fine and doesn't put any crazy load on our Bigtable cluster and Stage 1 usually finishes quickly (reading from Bigtable); but Step 2 and 3 with only 100 nodes take significantly longer. Is there anyway I can change maxNumWorkers dynamically after the first stage? I see apply(Wait.on) but not sure how to utilize it. My beam job looks like this:
pipeline
.apply("Read Bigtable",...)
.apply("Transform", ...)
.apply("Partition&Write", ...)
I am looking for a way to wait for .apply("Read Bigtable",...) to finish then increase maxNumWorkers. Essentially, my first stage is IO bound and I don't need CPU (workers) but my later stages are CPU bound and I need more CPU (workers).
Upvotes: 3
Views: 335
Reputation: 173
Did you try using file sharding to control the parallelism:
1) keep maxWorker to be 1000 2) right after reading from bigtable, save the data to a sharding of 100 3) Load the data again, do processing, but for final results, we write to a sharding of 1000.
I cannot guarantee, but it worth a try.
Upvotes: -1