Jean-Charles VERAZZI
Jean-Charles VERAZZI

Reputation: 173

Use case for dataflow (small SQL queries)

We're using Cloud Function to transform our datas in BigQuery : - all datas are in BigQuery - to transform data, we only use SQL queries in BigQuery - each query runs once a day - our biggest SQL query runs for about 2 to 3 minutes, but most queries runs for less than 30 seconds - we have about 50 queries executed once a day, and this number is increasing

We tried at first to do the same thing (SQL queries in BigQuery) with Dataflow, but : - it took about 10 to 15 minutes just to start dataflow - it is more complicated to code than our cloud functions - at that time, Dataflow SQL was not implemented

Every time we talk with someone using GCP (users, trainers or auditers), they recommend using Dataflow. So did we miss something "magic" with Dataflow, in our use case? Is there a way to make it start in seconds and not in minutes?

Also, if we use streaming in Dataflow, how are costs calculated? I understand that in batch we pay for what we use, but what if we use streaming? Is it counted as a full-time running service?

Thanks for your help

Upvotes: 1

Views: 665

Answers (1)

guillaume blaquiere
guillaume blaquiere

Reputation: 75735

For the first part, BigQuery VS Dataflow, I discussed this with Google weeks ago and their advice is clear:

  • When you can express your transformation in SQL, and you can reach your data with BigQuery (external table), it's always quicker and cheaper with BigQuery. Even if the request is complex.
  • For all the other use cases, Dataflow is the most recommended.
    • For realtime (with true need of realtime, with metrics figured out on the fly with windowing)
    • When you need to reach external API (ML, external service,...)
    • When you need to sink into something else than BigQuery (Firestore, BigTable, Cloud SQL,...) or read from a source not reachable by BigQuery.

And yes, Dataflow start in 3 minutes and stop in again 3 minutes. It's long... and you pay for this useless time.

For batch, like for streaming, you simply pay for the number (and the size) of the Compute Engine used for your pipeline. Dataflow scale automatically in the boundaries that you provide. Streaming pipeline don't scale to 0. If you haven't message in your PubSub, you still have at least 1 VM up and you pay for it.

Upvotes: 2

Related Questions