Reputation: 113
We've found experimentally that setting an explicit # of output shards in Dataflow/Apache Beam pipelines results in much worse performance. Our evidence suggests that Dataflow secretly does another GroupBy at the end. We've moved to letting Dataflow select the # of shards automatically (shards=0). However, for some pipelines this results in a huge number of relatively small output files (~15K files, each <1MB).
Is there anyway to send hints to Dataflow about the expected size of the outputs so it can scale accordingly? We notice that this problem happens mostly when the input dataset is quite large and the output is much smaller.
We're using Apache Beam Python 2.2.
Upvotes: 8
Views: 3536
Reputation: 7493
This type of hinting is not supported in Dataflow / Apache Beam. In general, Dataflow and Apache Beam are designed to be as "no knobs" as possible, for a couple reasons:
Upvotes: 2