Vibhor Jain
Vibhor Jain

Reputation: 1476

(BigQuery PY Client Library v0.28) - Fetch result from table 'query' job

I'm learning BigQuery API using Python Client Libraries v0.28 https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#run-a-simple-query

Wrote this simple code to fetch data from the table

1) Create client object

client_ = bigquery.Client.from_service_account_json('/Users/xyz/key.json')

2) Begin new Async query job

QUERY =  'SELECT visitid FROM `1234567.ga_sessions_20180101`'
query_job = client_.query(QUERY
                      , job_id=str(uuid.uuid4()))

3) poll until the query is DONE

while (query_job.state == 'RUNNING'):
    time.sleep(5)
    query_job.reload()

4) Fetch the results in iteration

query_job.reload()        
iter = query_job.result() 

At this stage I'd like to fetch how many rows are in the table. As per the doc GitHub code iter is of type bigquery.table.RowIterator with a property [tier.total_rows][1]

5) However, at this stage when I print:

print(iter.total_rows)

It keeps returning None

I'm pretty sure this table is NOT empty an dry query is correctly formatted!

Any help to any pointers what am I missing here will be really helpful... Thanks a lot!

Cheers!

Upvotes: 4

Views: 1873

Answers (2)

Ben Mares
Ben Mares

Reputation: 2157

The current behavior of how RowIterator returns None is indeed perplexing. Luckily, according to this issue, tswast's comment from 10 days ago indicates that the developers are working on a better solution.

Current awkward behavior of .total_rows

Currently, .total_rows is initialized only once iteration begins. (In what follows, for clarity I renamed your iter variable to row_iter.)

row_iter = query_job.result()
itr = iter(row_iter)
first_row = next(itr)
print(row_iter.total_rows)  # Now you get a number instead of None.

This is ugly because to continue the iteration, we must either handle the first row differently or call row_iter = query_job.result() again.

Temporary workaround

A currently-working alternative is to use the value of query_job._query_results.total_rows. Unfortunately this is cheating because _query_results is private, so there is no reason to expect that this will work in the future.

Future behavior

If tswast's proposal is implemented, then row_iter.total_rows will be initialized at the beginning, just as you expect.

Suggestion

In my code, I'm going to use something like

try:
    num_rows = row_iter.total_rows or query_job._query_results.total_rows
except NameError:
    num_rows = None

to be compatible with future behavior while falling-back to the temporary workaround if necessary.

Upvotes: 0

Daria
Daria

Reputation: 606

You need to also check query_job.error_result to make sure query succeeded.

You can also see your job in the UI, which can be useful for debugging, using project id and job id:

https://bigquery.cloud.google.com/results/projectid:jobid

Also, query_job.result() already waits for the job completion so you don't need to poll.

Upvotes: 2

Related Questions