Reputation: 2633
I've not found any documentation or any other questions/answers for my use case. I thought I would post a question.
On Apache Beam/Google Cloud Dataflow, I need to receive a PubSub Message, generate a dynamic query from BigQuery based on information from this message, pull rows from BigQuery containing batchIDs, create one file per batchID on Google Cloud Storage, and then stream in the rows to the batchID files. For each BigQuery row (represented as JSON strings), I would check its batchID and then add it to the correct batchID file as a new line.
I figured out the PubSub and BigQuery stuff. I'm now at the stage in which I've got a PCollection of Strings (each string is a row from BigQuery; the strings are grouped by a batchID).
I would like to:
In other words, I would like to create a file for each batchID and stream the strings to these files, as they come in. I would really like to avoid aggregating all the batchID strings together in memory (it could be GBs of data) and then writing to a file.
Upvotes: 0
Views: 455
Reputation: 5104
You could do a GroupByKey
on the batchID, then iterate over the values writing the file. The iterable of a GroupByKey
need not fit into memory.
Note that if you're writing files you may need to write to a temporary location and then do a rename to make things idempotent.
Upvotes: 1