Reputation: 23
I am running a Dataflow pipeline to parse about 45000 text files stored in a Cloud storage bucket. The parsed text is transformed into JSON and written to text files for subsequent loading in BigQuery (not part of the pipeline). A few minutes after the pipeline starts the number of target workers gets raised to > 30 (exact number varies slightly between runs), but the number of actual workers remains stuck at 1.
Things I have checked:
If I let the pipeline run it finishes successfully in about 2 hours, but I expect that this could run much faster if the actual workers would scale to the target.
Here is the relevant section of the code:
client = storage.Client()
blobs = client.list_blobs(bucket_name)
rf = [b.name for b in blobs]
with beam.Pipeline(options=pipeline_options) as p:
json_list = (p | 'Create filelist' >> beam.Create(rf)
| 'Get string' >> beam.Map(getstring)
| 'Filter empty strings' >> beam.Filter(lambda x: x != "")
| 'Get JSON' >> beam.Map(getjson)
| 'Write output' >> WriteToText(known_args.output))
Any suggestion as to what is preventing the workers from scaling up ?
Upvotes: 2
Views: 1283
Reputation: 3010
The issue here is that there is no parallelism available in this pipeline. The Create transform is single-sharded, and everything else in the pipeline is being fused together with that. Using one of the built in file reading transforms like ReadFromText will solve this, or you can put a Reshuffle transform after the Create in order to break the pipeline into two separate stages.
Upvotes: 2