Chris Alme
Chris Alme

Reputation: 81

BigQuery - unable to submit queries via batch API

Our application batches queries and submits via BigQuery's batch API. We've submitted several batches of queries whose jobs have been stuck in a "running" state for over an hour now. All systems are green according to status.cloud.google.com but that does not appear to be the case for us.

Anyone else experiencing similar behavior? FWIW - query submission via the BQ web UI is no longer working for us either due to exceeding concurrent rate limits (from aforementioned stuck jobs), so something is woefully wrong...

Upvotes: 0

Views: 618

Answers (1)

Michael Sheldon
Michael Sheldon

Reputation: 2057

You are submitting your queries via the batch API just fine. It looks like you are doing this very quickly and with computationally expensive queries, so they all compete with each other and slow down.

It looks like you submitted about 200 jobs at approximately the same time on the 18th (a few times), and about 25k jobs on the 17th.

These were all submitted at interactive query priority, and almost all of them immediately failed with rate limit exceeded errors, leaving the max concurrent quota limit of about 50 queries running from each set of queries you submitted.

Spot checking a few of these queries: these are computationally expensive queries. Take a look at your query's Billing Tier (https://cloud.google.com/bigquery/pricing#high-compute), which can be found in the jobs.get output here: https://cloud.google.com/bigquery/docs/reference/v2/jobs#statistics.query.billingTier. These queries also appear to be re-computing the same (or at least very similar) intermediate join results.

When you run 50 large queries simultaneously, they will compete with each other for resources and slow down.

There are several issues you may want to look in to:

  • You are submitting lots of queries at interactive query priority, which has a pretty strict concurrent rate limit. If you want to run many queries simultaneously, try using batch query priority. https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.query.priority
  • Your query mix looks like it can be optimized. Can you materialize some intermediate results that are common across all your queries with one join operation, and then run lots of smaller queries against those results?
  • If you need to run many computationally expensive queries quickly:
    • You may want to purchase additional slots to increase your query throughput. See https://cloud.google.com/bigquery/pricing#slots.
    • You may want to rate limit yourself on the client side to prevent your computationally expensive queries from competing with each other. Consider running only a few queries at a time. You're overall throughput will likely be faster.
  • You are using the batch insert API. This makes it very efficient to insert multiple queries with one HTTP request. I find that the HTTP connection is rarely the cause of latency with large scale data analysis, so to keep client code simple I prefer to use the regular jobs.insert API and insert jobs one at a time. (This becomes more important when you want to deal with error cases, as doing that with batched inserts is difficult.)

Upvotes: 3

Related Questions