Reputation: 173
We're using Cloud Function to transform our datas in BigQuery : - all datas are in BigQuery - to transform data, we only use SQL queries in BigQuery - each query runs once a day - our biggest SQL query runs for about 2 to 3 minutes, but most queries runs for less than 30 seconds - we have about 50 queries executed once a day, and this number is increasing
We tried at first to do the same thing (SQL queries in BigQuery) with Dataflow, but : - it took about 10 to 15 minutes just to start dataflow - it is more complicated to code than our cloud functions - at that time, Dataflow SQL was not implemented
Every time we talk with someone using GCP (users, trainers or auditers), they recommend using Dataflow. So did we miss something "magic" with Dataflow, in our use case? Is there a way to make it start in seconds and not in minutes?
Also, if we use streaming in Dataflow, how are costs calculated? I understand that in batch we pay for what we use, but what if we use streaming? Is it counted as a full-time running service?
Thanks for your help
Upvotes: 1
Views: 665
Reputation: 75735
For the first part, BigQuery VS Dataflow, I discussed this with Google weeks ago and their advice is clear:
And yes, Dataflow start in 3 minutes and stop in again 3 minutes. It's long... and you pay for this useless time.
For batch, like for streaming, you simply pay for the number (and the size) of the Compute Engine used for your pipeline. Dataflow scale automatically in the boundaries that you provide. Streaming pipeline don't scale to 0. If you haven't message in your PubSub, you still have at least 1 VM up and you pay for it.
Upvotes: 2