How do parallelize apache-beam (Dataflow) pipeline DAGs

Question

I am using apache-beam 2.5.0 python SDK

Attaching the code snippet, in a pipeline, I am taking i/p from pubsub topic parsing it and want to perform some operation on it, when I ran it with DataflowRunner it runs fine but it seems that "data processing fun1", "data processing fun2" "data processing fun3" are running in sequential, I need it to run in parallel. I am new to dataflow.

Is there a way to parallelize it?

def run():
   parser = argparse.ArgumentParser()
   args, pipeline_args = parser.parse_known_args()
   options = PipelineOptions(pipeline_args)

   with beam.Pipeline(options=options) as p:
       data = (p | "Read Pubsub Messages" >> 
            beam.io.ReadFromPubSub(topic=config.pub_sub_topic)
              | "Parse messages " >> beam.Map(parse_pub_sub_message_with_bq_data)
            )

       data | "data processing fun1 " >> beam.ParDo(Fun1())
       data | "data processing fun2" >> beam.ParDo(Fun2())
       data | "data processing fun3" >> beam.ParDo(Fun3())
if __name__ == '__main__':
   run()

How do parallelize apache-beam (Dataflow) pipeline DAGs

Answers (1)

Related Questions