Jiho Choi
Jiho Choi

Reputation: 1311

How to access the DML result of the child job with BigQuery Client?

I want to check the value of dml_affected_rows with BigQuery Client.

I have checked [1], and it works when the query (or DML) contains a single job. However, when the results are written from children's jobs like the below script, the num_dml_affected_rows is void. How can I check the child job's num_dml_affected_rows. I am think something like below.

query_job.children_job[-1].num_dml_affected_rows # (last child dml result)

But couldn't find the right way (or method) to accomplish it from the documents [2, 3].

[1] BigQuery update how to get number of updated rows
[2] BQ REST
[3] BQ Client API (dev)

Sample Script (python3)

sql = """
    INSERT `project.dataset.table` (logdatetime)
    SELECT
        CURRENT_DATETIME() as logdatetime
    ;
    SELECT 1
    ;
    INSERT `project.dataset.table` (logdatetime)
    SELECT
        CURRENT_DATETIME() as logdatetime
    ;
"""
query_job = bigquery.Client().query(sql)
query_job.result()
print(query_job.num_child_jobs)  # 3
print(query_job.num_dml_affected_rows)  # None

p.s. My current solution is to store the rows counts of destination table before and after the DML to check the difference, but this is unnecessary if I am able to directly access DML results of the child job.

Upvotes: 3

Views: 1071

Answers (2)

Anne
Anne

Reputation: 96

This may be a bit late for the original question but in case someone else stumbles across this, here is my solution:

def run_bigquery_query():
    bigquery_client = bigquery.Client()

    sql_query = """ SELECT / INSERT / ... """

    job = bigquery_client.query(sql_query)
    job.result()  # Wait for the whole script to finish

    affected_rows = 0
    for child_job in bigquery_client.list_jobs(parent_job=job.job_id):
        if child_job.num_dml_affected_rows is not None:
            affected_rows += child_job.num_dml_affected_rows

    print("Affected rows: {}".format(affected_rows))

Upvotes: 4

Yun Zhang
Yun Zhang

Reputation: 5503

The way that works in scripting is the system variable @@row_count.

If used in a script and the previous script statement is DML, specifies the number of rows modified, inserted, or deleted, as a result of that DML statement.

In your case, if you do

sql = """
    INSERT `project.dataset.table` (logdatetime)
    SELECT
        CURRENT_DATETIME() as logdatetime
    ;
    SELECT @@row_count -- output how many row inserted by previous DML
    ;
    INSERT `project.dataset.table` (logdatetime)
    SELECT
        CURRENT_DATETIME() as logdatetime
    ;
    SELECT @@row_count -- output how many row inserted by previous DML
    ;
"""
query_job = bigquery.Client().query(sql)
query_job.result()
print(query_job.num_child_jobs)  # you're getting 4 jobs here
print(query_job.num_dml_affected_rows)  # still None

Upvotes: 4

Related Questions