Reputation: 711
I need to query the Google BigQuery table and export the results to gzipped file. This is my current code. The requirement is that the each row data should be new line (\n) delemited.
def batch_job_handler(args):
credentials = Credentials.from_service_account_info(service_account_info)
client = Client(project=service_account_info.get("project_id"),
credentials=credentials)
query_job = client.query(QUERY_STRING)
results = query_job.result() # Result's total_rows is 1300000 records
with gzip.open("data/query_result.json.gz", "wb") as file:
data = ""
for res in results:
data += json.dumps(dict(list(res.items()))) + "\n"
break
file.write(bytes(data, encoding="utf-8"))
The above solution works perfectly fine for small number of result but, gets too slow if result has 1300000 records.
Is it because of this line: json.dumps(dict(list(res.items()))) + "\n"
as I am constructing a huge string by concatenating each records by new line.
As I am running this program in AWS batch, it is consuming too much time. I Need help on iterating over the result and writing to a file for millions of records in a faster way.
Upvotes: 2
Views: 7702
Reputation: 59225
Check out the new BigQuery Storage API for quick reads:
For an example of the API at work, see this project:
It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:
- Direct Streaming
It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
- Filtering
The new API allows column and limited predicate filtering to only read the data you are interested in.
- Column Filtering
Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.
- Predicate Filtering
The Storage API supports limited pushdown of predicate filters. It supports a single comparison to a literal
Upvotes: 1
Reputation: 7287
You should (in most cases) point your output from BigQuery query to a temp table and export that temp table to a Google Cloud Storage Bucket. From that bucket, you can download stuff locally. This is the fastest route to have results available locally. All else will be painfully slow, especially iterating over the results as BQ is not designed for that.
Upvotes: 0