Too many 'steps' when executing a pipeline

Question

We have a large data set which needs to be partition into 1,000 separate files, and the simplest implementation we wanted to use is to apply PartitionFn which, given an element of the data set, returns a random integer between 1 and 1,000. The problem with this approach is it ends up creating 1,000 PCollections and the pipeline does not launch as there seems to be a hard limit on the number of 'steps' (which correspond to the boxes shown on the job monitoring UI in execution graph).

Is there a way to increase this limit (and what is the limit)?

The solution we are using to get around this issue is to partition the data into a smaller subsets first (say 50 subsets), and for each subset we run another layer of partitioning pipelines to produce 20 subsets of each subset (so the end result is 1000 subsets), but it'll be nice if we can avoid this extra layer (as ends up creating 1 + 50 pipelines, and incurs the extra cost of writing and reading the intermediate data).

Too many 'steps' when executing a pipeline

Answers (1)

Related Questions

Too many &#39;steps&#39; when executing a pipeline

Answers (1)

Related Questions

Too many 'steps' when executing a pipeline