Reputation: 544
As of Fall 2019, BigQuery supports scripting, which is great. What I can't figure out is whether the Python client for BigQuery is capable of utilizing this new functionality yet.
For example, running the the following Python code:
client = bigquery.Client()
QUERY = """
BEGIN
CREATE OR REPLACE TEMP TABLE t0 AS
SELECT * FROM my_dataset.my_table WHERE foo < 1;
SELECT SUM(bar) AS bar_sum FROM t0;
DROP TABLE IF EXISTS t0;
END;
"""
query_job = client.query(QUERY)
rows = query_job.result()
... returns an google.cloud.bigquery.table._EmptyRowIterator
object even though I am able to see the statements in the SQL script have successfully run from BigQuery's web UI.
How do I return the results from SELECT statement in this standard SQL script to the Python client?
Upvotes: 5
Views: 9631
Reputation: 7058
It is supported but you need to take into account the following piece of documentation:
Scripts are executed in BigQuery using jobs.insert, similar to any other query, with the multi-statement script specified as the query text. When a script executes, additional jobs, known as child jobs, are created for each statement in the script. You can enumerate the child jobs of a script by calling jobs.list, passing in the script’s job ID as the parentJobId parameter.
When jobs.getQueryResults is invoked on a script, it will return the query results for the last SELECT, DML, or DDL statement to execute in the script, with no query results if none of the above statements have executed. To obtain the results of all statements in the script, enumerate the child jobs and call jobs.getQueryResults on each of them.
As an example, I modified your script to query a public table: bigquery-public-data.london_bicycles.cycle_stations
. This runs three child jobs:
where the last one drops the table and does not return any row:
That's why, if I run the Python file, I get something like <google.cloud.bigquery.table._EmptyRowIterator object at 0x7f440aa33c88>
.
What we want is the output result of the middle query:
A quick test is to comment out the DROP
statement and then iterate over the row(s) to get the result of sum=6676
. So, what if we want the intermediate results? The answer, as in the previously cited docs, is to call jobs.list
and pass the script job ID as the parentJobId
parameter to get the child job IDs:
for job in client.list_jobs(parent_job=query_job.job_id):
print("Job ID: {}, Statement Type: {}".format(job.job_id, job.statement_type))
We use the list_jobs
method and check ID and statement type:
Job ID: script_job_80e...296_2, Statement Type: DROP_TABLE
Job ID: script_job_9a0...7fd_1, Statement Type: SELECT
Job ID: script_job_113...e13_0, Statement Type: CREATE_TABLE_AS_SELECT
Note that the suffix (0, 1, 2) indicates the execution order but we can add a double check to verify that the job is actually a SELECT
statement before retrieving the results:
from google.cloud import bigquery
client = bigquery.Client()
QUERY = """
BEGIN
CREATE OR REPLACE TEMP TABLE t0 AS
SELECT name, bikes_count FROM `bigquery-public-data.london_bicycles.cycle_stations` WHERE bikes_count > 10;
SELECT SUM(bikes_count) AS total_bikes FROM t0;
DROP TABLE IF EXISTS t0;
END;
"""
query_job = client.query(QUERY)
query_job.result()
for job in client.list_jobs(parent_job=query_job.job_id): # list all child jobs
# print("Job ID: {}, Statement Type: {}".format(job.job_id, job.statement_type))
if job.statement_type == "SELECT": # print the desired job output only
rows = job.result()
for row in rows:
print("sum={}".format(row["total_bikes"]))
output:
sum=6676
Upvotes: 11