Reputation: 507
I'm using the python apache_beam version of dataflow. I have about 300 files with an array of 4 million entries each. The whole thing is about 5Gb, stored on a gs bucket.
I can easily produce a PCollection of the arrays {x_1, ... x_n}
by reading each file, but the operation I now need to perform is like the python zip function: I want a PCollection ranging from 0 to n-1, where each element i
contains the array of all the x_i
across the files. I tried yield
ing (i, element)
for each element and then running GroupByKey, but that is much too slow and inefficient (it won't run at all locally because of memory constraints, and it took 24 hours on the cloud, whereas I'm sure I can at least load all of the dataset if I want).
How do I restructure the pipeline to do this cleanly?
Upvotes: 1
Views: 360
Reputation: 507
As jkff pointed out in the above comment, the code is indeed correct and the procedure is the recommended way of programming a tensorflow algorithm. The DoFn applied to each element was the bottleneck.
Upvotes: 0