Apache Beam - What are the key concepts for writing efficient data processing pipelines I should be aware of?

Question

I've been using Beam for some time now and I'd like to know what are the key concepts for writing efficient and optimized Beam pipelines.

I have a little Spark background and I know that we may prefer to use a reduceByKey instead of a groupByKey to avoid shuffling and optimise network traffic.

Is it the same case for Beam?

I'd appreciate some tips or materials/best pratices.

Reza Rokni · Accepted Answer

Some items to consider:

Graph Design Considerations:

Filer first; place filter operations as high in the DAG as possible)
Combine early; If there is a choice as to when to combine, do this as early as possible.
If possible reduce the effect of large sliding windows by using smaller fixed windows before the large sliding window. FixedWindow.of(1m) | Combine | SlidingWindow.of( 6 Hour)
Most runners will support graph fusion, which is the right thing 99% of the time. But in the case of a massive fanout transform you should break fusion.

Coders

Choose coders that provide good performance, for example in Java use something like Proto or Avro coder and not default java serialization.
Advanced Tip: Encoding/decoding is a large source of overhead. Thus if you have a large blob but only need part of it structured you could selectively decode just that part.

Logging

Avoid Log.info at a per element level, this is rarely of value and is the root cause of many raised performance related issues.

Data Skew

Understand the data set and the implications of Hot keys. Fields that are used as Keys which are Nullable are often culprits... Make use of parallelism hints if needed withFanOut
For keys in general
- Too few keys: bad - hard to shard workload and per-key ordering will effect performance
- Too many keys: can be bad too - overhead starts to creep in.
Advanced key tip:
- Sometimes you can combine a key with the window of the element {key,Window} to help with distribute work more
- Not a requirement but if you have the ability and want to go into this level of optimisation; Aim for ~ O(10K) to O(100K) keys. if the keyspace is much larger, you can consider using hashes separating keys out internally. This is especially useful if keys carry date/time information. In this case you can "re-use" processing keys from the past that are not active anymore essentially for free.

Sources, Sinks and External Systems

Compressed files can be easily read with an option flag, however without an Offset TextIO can not distribute this task. If you have very large files to read uncompressing the file before starting the pipeline can provide a nice performance boost. Also look at using formats like compressed Avro.
- BackPressure: Beam runners are designed to be able to rapidly chew through parallel work. They can spin up many threads across many machines to achieve this goal. This can easily swamp external systems, especially if you are making a per element RPC call. If the external system can not be made to scale, create batches using startBundle / finishBundle to help elevate the invocations per second
- Speed of light, is well still the speed of light.. :-) Avoid using sinks and sources which are far apart from your workers.

Metrics

Make use of Beam metrics to instrument your pipelines.