KrysztovCddsz
KrysztovCddsz

Reputation: 23

Why Google Cloud Dataflow doesn't scale to target workers with autoscaling enabled and no quota limitations

I am running a Dataflow pipeline to parse about 45000 text files stored in a Cloud storage bucket. The parsed text is transformed into JSON and written to text files for subsequent loading in BigQuery (not part of the pipeline). A few minutes after the pipeline starts the number of target workers gets raised to > 30 (exact number varies slightly between runs), but the number of actual workers remains stuck at 1.

Things I have checked:

If I let the pipeline run it finishes successfully in about 2 hours, but I expect that this could run much faster if the actual workers would scale to the target.

Here is the relevant section of the code:

client = storage.Client()
blobs = client.list_blobs(bucket_name)

rf = [b.name for b in blobs]

with beam.Pipeline(options=pipeline_options) as p:
    json_list = (p | 'Create filelist' >> beam.Create(rf)
                   | 'Get string' >> beam.Map(getstring)
                   | 'Filter empty strings' >> beam.Filter(lambda x: x != "")
                   | 'Get JSON' >> beam.Map(getjson)
                   | 'Write output' >> WriteToText(known_args.output))

Any suggestion as to what is preventing the workers from scaling up ?

Upvotes: 2

Views: 1283

Answers (1)

danielm
danielm

Reputation: 3010

The issue here is that there is no parallelism available in this pipeline. The Create transform is single-sharded, and everything else in the pipeline is being fused together with that. Using one of the built in file reading transforms like ReadFromText will solve this, or you can put a Reshuffle transform after the Create in order to break the pipeline into two separate stages.

Upvotes: 2

Related Questions