Dynamically scale Cloud Dataflow job

Question

I've a Cloud Dataflow pipeline that looks like this:

Read From Cloud Bigtable
Do some transformation
Write to GCS

Initially without setting any max workers, dataflow autoscaling would scale to max(1000 nodes) and would put a LOT of stress on our Bigtable cluster. Then I specified some maxNumWorkers to say 100 and it's fine and doesn't put any crazy load on our Bigtable cluster and Stage 1 usually finishes quickly (reading from Bigtable); but Step 2 and 3 with only 100 nodes take significantly longer. Is there anyway I can change maxNumWorkers dynamically after the first stage? I see apply(Wait.on) but not sure how to utilize it. My beam job looks like this:

pipeline
 .apply("Read Bigtable",...)
 .apply("Transform", ...)
 .apply("Partition&Write", ...)

I am looking for a way to wait for .apply("Read Bigtable",...) to finish then increase maxNumWorkers. Essentially, my first stage is IO bound and I don't need CPU (workers) but my later stages are CPU bound and I need more CPU (workers).

Dynamically scale Cloud Dataflow job

Answers (1)

Related Questions