Enucatl
Enucatl

Reputation: 507

what's the equivalent of the python zip function in dataflow?

I'm using the python apache_beam version of dataflow. I have about 300 files with an array of 4 million entries each. The whole thing is about 5Gb, stored on a gs bucket.

I can easily produce a PCollection of the arrays {x_1, ... x_n} by reading each file, but the operation I now need to perform is like the python zip function: I want a PCollection ranging from 0 to n-1, where each element i contains the array of all the x_i across the files. I tried yielding (i, element) for each element and then running GroupByKey, but that is much too slow and inefficient (it won't run at all locally because of memory constraints, and it took 24 hours on the cloud, whereas I'm sure I can at least load all of the dataset if I want).

How do I restructure the pipeline to do this cleanly?

Upvotes: 1

Views: 360

Answers (1)

Enucatl
Enucatl

Reputation: 507

As jkff pointed out in the above comment, the code is indeed correct and the procedure is the recommended way of programming a tensorflow algorithm. The DoFn applied to each element was the bottleneck.

Upvotes: 0

Related Questions