Reputation: 789
Consider the following script (adapted from the Google Cloud Python documentation: https://google-cloud-python.readthedocs.io/en/0.32.0/bigquery/usage.html#querying-data), which runs a BigQuery query with a timeout of 30 seconds:
import logging
from google.cloud import bigquery
# Set logging level to DEBUG in order to see the HTTP requests
# being made by urllib3
logging.basicConfig(level=logging.DEBUG)
PROJECT_ID = "project_id" # replace by actual project ID
client = bigquery.Client(project=PROJECT_ID)
QUERY = ('SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
'WHERE state = "TX" '
'LIMIT 100')
TIMEOUT = 30 # in seconds
query_job = client.query(QUERY) # API request - starts the query
assert query_job.state == 'RUNNING'
# Waits for the query to finish
iterator = query_job.result(timeout=TIMEOUT)
rows = list(iterator)
assert query_job.state == 'DONE'
assert len(rows) == 100
row = rows[0]
assert row[0] == row.name == row['name']
The linked documentation says:
Use of the timeout parameter is optional. The query will continue to run in the background even if it takes longer the timeout allowed.
When I run it with google-cloud-bigquery version 1.23.1, the logging output seem to indicate that "timeoutMs" is 10 seconds.
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/project_id/queries/5ceceaeb-e17c-4a86-8a27-574ad561b856?maxResults=0&timeoutMs=10000&location=US HTTP/1.1" 200 None
Notice the timeoutMs=10000
in the output above.
This seems to happen whenever I call result
with a timeout value that is higher than 10. On the other hand, if I use a value lower than 10 as the timeout value, the timeoutMs value looks correct. For example, if I change TIMEOUT = 30
to TIMEOUT = 5
in the script above, the log shows:
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/project_id/queries/71a28435-cbcb-4d73-b932-22e58e20d994?maxResults=0&timeoutMs=4900&location=US HTTP/1.1" 200 None
Is this behavior expected?
Thank you in advance and best regards.
Upvotes: 2
Views: 6305
Reputation: 2850
The timeout parameter performs in a best-effort manner to execute all the API calls within the method in the timeframe indicated. Internally, the result()
method can perform more than one request, and the getQueryResults
request in the log:
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/project_id/queries/5ceceaeb-e17c-4a86-8a27-574ad561b856?maxResults=0&timeoutMs=10000&location=US HTTP/1.1" 200 None
is executed inside the done()
method. You can see the source code to understand how the timeout for the request is calculated, but basically, it is the minimum value between 10 seconds and the user timeout. If the operation has not been completed, it will be retried until the timeout has been reached.
Upvotes: 1