xibalba1
xibalba1

Reputation: 544

Returning Results of BigQuery Script to Python Client

As of Fall 2019, BigQuery supports scripting, which is great. What I can't figure out is whether the Python client for BigQuery is capable of utilizing this new functionality yet.

For example, running the the following Python code:

client = bigquery.Client()
QUERY = """
BEGIN
    CREATE OR REPLACE TEMP TABLE t0 AS
        SELECT * FROM my_dataset.my_table WHERE foo < 1;

    SELECT SUM(bar) AS bar_sum FROM t0;

    DROP TABLE IF EXISTS t0;
END;
"""

query_job = client.query(QUERY)
rows = query_job.result()

... returns an google.cloud.bigquery.table._EmptyRowIterator object even though I am able to see the statements in the SQL script have successfully run from BigQuery's web UI.

How do I return the results from SELECT statement in this standard SQL script to the Python client?

Upvotes: 5

Views: 9631

Answers (1)

Guillem Xercavins
Guillem Xercavins

Reputation: 7058

It is supported but you need to take into account the following piece of documentation:

Scripts are executed in BigQuery using jobs.insert, similar to any other query, with the multi-statement script specified as the query text. When a script executes, additional jobs, known as child jobs, are created for each statement in the script. You can enumerate the child jobs of a script by calling jobs.list, passing in the script’s job ID as the parentJobId parameter.

When jobs.getQueryResults is invoked on a script, it will return the query results for the last SELECT, DML, or DDL statement to execute in the script, with no query results if none of the above statements have executed. To obtain the results of all statements in the script, enumerate the child jobs and call jobs.getQueryResults on each of them.

As an example, I modified your script to query a public table: bigquery-public-data.london_bicycles.cycle_stations. This runs three child jobs:

enter image description here

where the last one drops the table and does not return any row:

enter image description here

That's why, if I run the Python file, I get something like <google.cloud.bigquery.table._EmptyRowIterator object at 0x7f440aa33c88>.

What we want is the output result of the middle query:

enter image description here

A quick test is to comment out the DROP statement and then iterate over the row(s) to get the result of sum=6676. So, what if we want the intermediate results? The answer, as in the previously cited docs, is to call jobs.list and pass the script job ID as the parentJobId parameter to get the child job IDs:

for job in client.list_jobs(parent_job=query_job.job_id):
    print("Job ID: {}, Statement Type: {}".format(job.job_id, job.statement_type))

We use the list_jobs method and check ID and statement type:

Job ID: script_job_80e...296_2, Statement Type: DROP_TABLE
Job ID: script_job_9a0...7fd_1, Statement Type: SELECT
Job ID: script_job_113...e13_0, Statement Type: CREATE_TABLE_AS_SELECT

Note that the suffix (0, 1, 2) indicates the execution order but we can add a double check to verify that the job is actually a SELECT statement before retrieving the results:

from google.cloud import bigquery

client = bigquery.Client()
QUERY = """
BEGIN
    CREATE OR REPLACE TEMP TABLE t0 AS
        SELECT name, bikes_count FROM `bigquery-public-data.london_bicycles.cycle_stations` WHERE bikes_count > 10;

    SELECT SUM(bikes_count) AS total_bikes FROM t0;

    DROP TABLE IF EXISTS t0;
END;
"""

query_job = client.query(QUERY)
query_job.result()

for job in client.list_jobs(parent_job=query_job.job_id):  # list all child jobs
    # print("Job ID: {}, Statement Type: {}".format(job.job_id, job.statement_type))
    if job.statement_type == "SELECT":  # print the desired job output only
        rows = job.result()
        for row in rows:
            print("sum={}".format(row["total_bikes"]))

output:

sum=6676

Upvotes: 11

Related Questions