Reputation: 863
I've been using Beam for some time now and I'd like to know what are the key concepts for writing efficient and optimized Beam pipelines.
I have a little Spark background and I know that we may prefer to use a reduceByKey instead of a groupByKey to avoid shuffling and optimise network traffic.
Is it the same case for Beam?
I'd appreciate some tips or materials/best pratices.
Upvotes: 4
Views: 1925
Reputation: 1256
Some items to consider:
Filer first; place filter operations as high in the DAG as possible)
Combine early; If there is a choice as to when to combine, do this as early as possible.
If possible reduce the effect of large sliding windows by using smaller fixed windows before the large sliding window. FixedWindow.of(1m) | Combine | SlidingWindow.of( 6 Hour)
Most runners will support graph fusion, which is the right thing 99% of the time. But in the case of a massive fanout transform you should break fusion.
For keys in general
Advanced key tip:
Compressed files can be easily read with an option flag, however without an Offset TextIO can not distribute this task. If you have very large files to read uncompressing the file before starting the pipeline can provide a nice performance boost. Also look at using formats like compressed Avro.
BackPressure: Beam runners are designed to be able to rapidly chew through parallel work. They can spin up many threads across many machines to achieve this goal. This can easily swamp external systems, especially if you are making a per element RPC call. If the external system can not be made to scale, create batches using startBundle / finishBundle to help elevate the invocations per second
Speed of light, is well still the speed of light.. :-) Avoid using sinks and sources which are far apart from your workers.
Upvotes: 11