Reputation: 4672
In Google's apache-beam dataflow pipeline one can write data to a textfile, the apache-beam's website makes note of the possibility to write "multiple output files" here but the code is the same as for aggregate files (and that's all I get).
Is it possible to generate a file for each item in a PCollection?
Upvotes: 0
Views: 1151
Reputation: 17913
You can do this yourself by mapping the PCollection
with a ParDo
that takes an element and writes it to a file using the FileSystems
API. The Java version of the API is here, the Python verison is here; in the Java version, you'll need to use FileSystems.open()
.
Note that likely your pipeline will be vulnerable to issues in case your workers fail and the work gets retried, in that case you may have leftover garbage files from failed attempts.
For a more general solution, you'll need to wait until http://s.apache.org/fileio-write which is currently being implemented and will be released in Beam Java 2.2.
Upvotes: 1