Graham Polley
Graham Polley

Reputation: 14791

Dataflow streaming - does it fit our use case?

We've been using Dataflow in batch mode for a while now. However, we can't seem to find much info on its streaming mode.

We have a the following use case:

Now, we could of course use Dataflow in batch mode, and take chucks of the data from BigQuery (based on timestamps), and transform/clean/denormalize it that way.

But that's a bit of a messy approach, especially because data is being streamed real-time and it will probably get real gnarly working out which data needs to be worked on. Sounds brittle too.

It would be great if we could simply transform/clean/denormalize in Dataflow, and then write to BigQuery as it's streaming in.

Is this what Dataflow streaming is intended for? If so, what data source can Dataflow read from in streaming mode?

Upvotes: 2

Views: 233

Answers (1)

Tyler Akidau
Tyler Akidau

Reputation: 206

Yes, that is a very reasonable use case for streaming mode. Currently we support reading from Cloud Pub/Sub via the PubsubIO source. Additional sources are in the works. Output can be written to BigQuery via the BigQueryIO sink. The PCollection docs cover the distinction between bounded and unbounded sources/sinks, as well as the currently available concrete implementations.

As to any apparent lack of streaming-specific documentation, the majority of the unified model is applicable in batch and streaming, so there is no streaming-specific section. That said, I'd recommend looking over the Windowing and Triggers sections of the PCollection docs, as those are particularly applicable when dealing with unbounded PCollections.

Upvotes: 3

Related Questions