Reputation: 305
We have a large data set which needs to be partition into 1,000 separate files, and the simplest implementation we wanted to use is to apply PartitionFn which, given an element of the data set, returns a random integer between 1 and 1,000. The problem with this approach is it ends up creating 1,000 PCollections and the pipeline does not launch as there seems to be a hard limit on the number of 'steps' (which correspond to the boxes shown on the job monitoring UI in execution graph).
Is there a way to increase this limit (and what is the limit)?
The solution we are using to get around this issue is to partition the data into a smaller subsets first (say 50 subsets), and for each subset we run another layer of partitioning pipelines to produce 20 subsets of each subset (so the end result is 1000 subsets), but it'll be nice if we can avoid this extra layer (as ends up creating 1 + 50 pipelines, and incurs the extra cost of writing and reading the intermediate data).
Upvotes: 0
Views: 116
Reputation: 6130
Rather than using the Partition
transform and introducing many steps in the pipeline consider using either of the following approaches:
TextIO
has a withNumShards method. If you pass this 1000 it will produce 1000 separate shards in the specified directory.GroupByKey
+ a DoFn
to write the results.Upvotes: 1