Creating a file and streaming in data on Apache Beam/Google Cloud Dataflow

Question

I've not found any documentation or any other questions/answers for my use case. I thought I would post a question.

On Apache Beam/Google Cloud Dataflow, I need to receive a PubSub Message, generate a dynamic query from BigQuery based on information from this message, pull rows from BigQuery containing batchIDs, create one file per batchID on Google Cloud Storage, and then stream in the rows to the batchID files. For each BigQuery row (represented as JSON strings), I would check its batchID and then add it to the correct batchID file as a new line.

I figured out the PubSub and BigQuery stuff. I'm now at the stage in which I've got a PCollection of Strings (each string is a row from BigQuery; the strings are grouped by a batchID).

I would like to:

look at the batchID from each string, as it comes in
if a file does not exist for this batchID, create a new file, else, do nothing
add each string to a new line in the file corresponding to its batchID

In other words, I would like to create a file for each batchID and stream the strings to these files, as they come in. I would really like to avoid aggregating all the batchID strings together in memory (it could be GBs of data) and then writing to a file.

robertwb · Accepted Answer

You could do a GroupByKey on the batchID, then iterate over the values writing the file. The iterable of a GroupByKey need not fit into memory.

Note that if you're writing files you may need to write to a temporary location and then do a rename to make things idempotent.

Creating a file and streaming in data on Apache Beam/Google Cloud Dataflow

Answers (1)

Related Questions