sgiri
sgiri

Reputation: 711

How to iterate through BigQuery query results and write it to file

I need to query the Google BigQuery table and export the results to gzipped file. This is my current code. The requirement is that the each row data should be new line (\n) delemited.

def batch_job_handler(args):

    credentials = Credentials.from_service_account_info(service_account_info)

    client = Client(project=service_account_info.get("project_id"),
                    credentials=credentials)

    query_job = client.query(QUERY_STRING)

    results = query_job.result()  # Result's total_rows is 1300000 records

    with gzip.open("data/query_result.json.gz", "wb") as file:
        data = ""
        for res in results:
            data += json.dumps(dict(list(res.items()))) + "\n"
            break
        file.write(bytes(data, encoding="utf-8"))

The above solution works perfectly fine for small number of result but, gets too slow if result has 1300000 records.

Is it because of this line: json.dumps(dict(list(res.items()))) + "\n" as I am constructing a huge string by concatenating each records by new line.

As I am running this program in AWS batch, it is consuming too much time. I Need help on iterating over the result and writing to a file for millions of records in a faster way.

Upvotes: 2

Views: 7702

Answers (2)

Felipe Hoffa
Felipe Hoffa

Reputation: 59225

Check out the new BigQuery Storage API for quick reads:

For an example of the API at work, see this project:

It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:

  • Direct Streaming

It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.

  • Filtering

The new API allows column and limited predicate filtering to only read the data you are interested in.

  • Column Filtering

Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.

  • Predicate Filtering

The Storage API supports limited pushdown of predicate filters. It supports a single comparison to a literal

Upvotes: 1

saifuddin778
saifuddin778

Reputation: 7287

You should (in most cases) point your output from BigQuery query to a temp table and export that temp table to a Google Cloud Storage Bucket. From that bucket, you can download stuff locally. This is the fastest route to have results available locally. All else will be painfully slow, especially iterating over the results as BQ is not designed for that.

Upvotes: 0

Related Questions