Controlling Dataflow/Apache Beam output sharding

Question

We've found experimentally that setting an explicit # of output shards in Dataflow/Apache Beam pipelines results in much worse performance. Our evidence suggests that Dataflow secretly does another GroupBy at the end. We've moved to letting Dataflow select the # of shards automatically (shards=0). However, for some pipelines this results in a huge number of relatively small output files (~15K files, each <1MB).

Is there anyway to send hints to Dataflow about the expected size of the outputs so it can scale accordingly? We notice that this problem happens mostly when the input dataset is quite large and the output is much smaller.

We're using Apache Beam Python 2.2.

Scott Wegner · Accepted Answer

This type of hinting is not supported in Dataflow / Apache Beam. In general, Dataflow and Apache Beam are designed to be as "no knobs" as possible, for a couple reasons:

To allow the Dataflow service to intelligently make optimization decisions on its own. Dataflow has smart autoscaling capabilities which can scale the number of worker VMs up or down according to the current workload.
To ensure that pipelines written with the Apache Beam SDK are portable across runners (such as Dataflow, Spark or Flink). Pipeline logic is written in terms of a set of abstractions such that the job can be run in a variety of environments. Each worker may apply its own set of optimizations to these high-level abstractions.

Controlling Dataflow/Apache Beam output sharding

Answers (1)

Related Questions