Dataflow streaming - does it fit our use case?

Question

We've been using Dataflow in batch mode for a while now. However, we can't seem to find much info on its streaming mode.

We have a the following use case:

Data/events are being streamed real-time into BigQuery
We need to transform/clean/denormalize the data before analysis by the business

Now, we could of course use Dataflow in batch mode, and take chucks of the data from BigQuery (based on timestamps), and transform/clean/denormalize it that way.

But that's a bit of a messy approach, especially because data is being streamed real-time and it will probably get real gnarly working out which data needs to be worked on. Sounds brittle too.

It would be great if we could simply transform/clean/denormalize in Dataflow, and then write to BigQuery as it's streaming in.

Is this what Dataflow streaming is intended for? If so, what data source can Dataflow read from in streaming mode?

Tyler Akidau · Accepted Answer

Yes, that is a very reasonable use case for streaming mode. Currently we support reading from Cloud Pub/Sub via the PubsubIO source. Additional sources are in the works. Output can be written to BigQuery via the BigQueryIO sink. The PCollection docs cover the distinction between bounded and unbounded sources/sinks, as well as the currently available concrete implementations.

As to any apparent lack of streaming-specific documentation, the majority of the unified model is applicable in batch and streaming, so there is no streaming-specific section. That said, I'd recommend looking over the Windowing and Triggers sections of the PCollection docs, as those are particularly applicable when dealing with unbounded PCollections.

Dataflow streaming - does it fit our use case?

Answers (1)

Related Questions