Skip forbidden rows from a BigQuery query, using Python

Question

I need to download a relatively small table from BigQuery and store it (after some parsing) in a Panda dataframe .

Here is the relevant sample of my code:

from google.cloud import bigquery

client = bigquery.Client(project="project_id")
job_config = bigquery.QueryJobConfig(allow_large_results=True)
query_job = client.query("my sql string", job_config=job_config)
result = query_job.result()
rows = [dict(row) for row in result]
pdf = pd.DataFrame.from_dict(rows)

My problem:

After a few thousands rows parsed, one of them is too big and I get an exception: google.api_core.exceptions.Forbidden.

So, after a few iterations, I tried to transform my loop to something that looks like:

rows = list()
for _ in range(result.total_rows):
    try:
        rows.append(dict(next(result)))
    except google.api_core.exceptions.Forbidden:
        pass

BUT it doesn't work since result is a bigquery.table.RowIterator and despite its name, it's not an iterator... it's an iterable

So... what do I do now? Is there a way to either:

ask for the next row in a try/except scope?
tell bigquery to skip bad rows?

Sergey Geron · Accepted Answer

Did you try paging through query results?

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    SELECT name, SUM(number) as total_people
    FROM `bigquery-public-data.usa_names.usa_1910_2013`
    GROUP BY name
    ORDER BY total_people DESC
"""
query_job = client.query(query)  # Make an API request.
query_job.result()  # Wait for the query to complete.

# Get the destination table for the query results.
#
# All queries write to a destination table. If a destination table is not
# specified, the BigQuery populates it with a reference to a temporary
# anonymous table after the query completes.
destination = query_job.destination

# Get the schema (and other properties) for the destination table.
#
# A schema is useful for converting from BigQuery types to Python types.
destination = client.get_table(destination)

# Download rows.
#
# The client library automatically handles pagination.
print("The query data:")
rows = client.list_rows(destination, max_results=20)
for row in rows:
    print("name={}, count={}".format(row["name"], row["total_people"]))

Also you can try to filter out big rows in your query:

WHERE LENGTH(some_field) < 123

or

WHERE LENGTH(CAST(some_field AS BYTES)) < 123

Skip forbidden rows from a BigQuery query, using Python

Answers (1)

Related Questions