Haden Hooyeon Lee
Haden Hooyeon Lee

Reputation: 305

Too many 'steps' when executing a pipeline

We have a large data set which needs to be partition into 1,000 separate files, and the simplest implementation we wanted to use is to apply PartitionFn which, given an element of the data set, returns a random integer between 1 and 1,000. The problem with this approach is it ends up creating 1,000 PCollections and the pipeline does not launch as there seems to be a hard limit on the number of 'steps' (which correspond to the boxes shown on the job monitoring UI in execution graph).

Is there a way to increase this limit (and what is the limit)?

The solution we are using to get around this issue is to partition the data into a smaller subsets first (say 50 subsets), and for each subset we run another layer of partitioning pipelines to produce 20 subsets of each subset (so the end result is 1000 subsets), but it'll be nice if we can avoid this extra layer (as ends up creating 1 + 50 pipelines, and incurs the extra cost of writing and reading the intermediate data).

Upvotes: 0

Views: 116

Answers (1)

Ben Chambers
Ben Chambers

Reputation: 6130

Rather than using the Partition transform and introducing many steps in the pipeline consider using either of the following approaches:

  1. Many sinks support the option to specify the number of output shards. For example, TextIO has a withNumShards method. If you pass this 1000 it will produce 1000 separate shards in the specified directory.
  2. Using the shard number as a key and using a GroupByKey + a DoFn to write the results.

Upvotes: 1

Related Questions