Reputation: 16955
I have a PCollection<String>
in Google Cloud DataFlow and I'm outputting it to text files via TextIO.Write.to
:
PCollection<String> lines = ...;
lines.apply(TextIO.Write.to("gs://bucket/output.txt"));
Currently the lines of each shard of output are in random order.
Is it possible to get Dataflow to output the lines in sorted order?
Upvotes: 1
Views: 730
Reputation: 6023
This is not directly supported by Dataflow.
For a bounded PCollection
, if you shard your input finely enough, then you can write sorted files with a Sink
implementation that sorts each shard. You may want to refer to the TextSink
implementation for a basic outline.
Upvotes: 3