Google Dataflow - clarification around pricing for streaming pipeline with bounded data

Question

I am a bit confused about some of the Dataflow pricing around streaming:

I have a pipeline where at the very end , I am trying to load data into BigQuery using the FILE_LOADS method, but with a triggering_frequency set, however that seems to demand that the pipeline has to be a streaming pipeline. This is the only reason I need to set the pipeline as streaming. Everything else is perfectly batch, and the data source of the pipeline is also bounded (another BigQuery table).

Now if I enabled --streaming, what would be the effect of the pricing on this pipeline? Looking at the pricing link, it says the following are billed:

The volume of data ingested into your streaming pipeline
The complexity of the pipeline
The number of pipeline stages with shuffle operation or with stateful DoFns

Now, my question is will all these also apply to the previous steps/DoFns in my pipeline even though those are working on bounded data?

ningk · Accepted Answer

Yes, they will apply to the whole pipeline.

Your cost should still be relatively the same since your volume of data and pipeline haven't changed. The triggering_frequency merely changes how often a load job is triggered.

Why do you need to set this frequency though? Does the default behavior not work for your batch job? I'm not sure how the pipeline will terminate in this setup. Will you have to cancel it once everything is processed?

Google Dataflow - clarification around pricing for streaming pipeline with bounded data

Answers (1)

Related Questions