Reputation: 41
I am using apache-beam 2.5.0 python SDK
Attaching the code snippet, in a pipeline, I am taking i/p from pubsub topic parsing it and want to perform some operation on it, when I ran it with DataflowRunner it runs fine but it seems that "data processing fun1", "data processing fun2" "data processing fun3" are running in sequential, I need it to run in parallel. I am new to dataflow.
Is there a way to parallelize it?
def run():
parser = argparse.ArgumentParser()
args, pipeline_args = parser.parse_known_args()
options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=options) as p:
data = (p | "Read Pubsub Messages" >>
beam.io.ReadFromPubSub(topic=config.pub_sub_topic)
| "Parse messages " >> beam.Map(parse_pub_sub_message_with_bq_data)
)
data | "data processing fun1 " >> beam.ParDo(Fun1())
data | "data processing fun2" >> beam.ParDo(Fun2())
data | "data processing fun3" >> beam.ParDo(Fun3())
if __name__ == '__main__':
run()
Upvotes: 3
Views: 3608
Reputation: 11031
Why do you need these functions to run at the same time?
Beam / Dataflow take your graph, and try to optimize things that can run in the same thread. This called fusion optimization, and it's mentioned in the Flume Java paper.
The point is that it will usually be more efficient to run those functions one by one on the same thread, rather than interchange data between multiple processing threads or VMs, to parallelize the processing.
If your funtions must run more or less in parallel, you can add a beam.Reshuffle
transform before the downstream transforms:
data = (p
| beam.io.ReadFromPubSub(topic)
| beam.Map(parse_messages))
# After the data has been shuffled, it may be consumed by multiple workers
data | beam.Reshuffle() | beam.ParDo(Fun1())
data | beam.Reshuffle() | beam.ParDo(Fun2())
data | beam.Reshuffle() | beam.ParDo(Fun3())
Let me know if I can add some detail into this.
Upvotes: 2