Sam
Sam

Reputation: 33

How multithreading works in apache beam pipeline with bounded sources?

I am new in big data processing. I am using apache beam Java SDK to work with it. Trying to understand how multithreading/parallel data processing works in apache beam pipeline. How data being process from one PTransform to another with respect to multi threading?

Upvotes: 3

Views: 1966

Answers (1)

Alexey Romanenko
Alexey Romanenko

Reputation: 1443

If you are talking about parallel data processing, then, generally speaking, Beam relies on leveraged data processing engine underneath (e.g. Spark, Flink, Dataflow, etc). Usually, you can't directly control the things like "how many workers will be used", "how your input PCollection will be chunked and parallelised", etc - it will be a responsibility of used engine.

However, it's assumed that input data will be split into bundles and every instance of your DoFn in pipeline will process one or more bundles on a worker but by one or more workers in the same time. In this way, data can be processed in parallel - every single bundle on every single worker. And there is no any coordination or sync mechanism among bundles (like we have with multithreading) - we have to assume that they are processed independently and in arbitrary order.

So, this is very "bird-eye" view. If you have any specific question - don't hesitate to ask.

Upvotes: 3

Related Questions