What is the best place to run BigQuery queries in Google Cloud platform?

I have some files that contains thousands of rows that I need to insert into Google BigQuery, so, because the execution time exceeds the 60s request limit in AppEngine, I moved the BQ queries in a task queue.

For now, It works very well, but I don't know if this is the best place to put BQ queries. I am saying this because the requests are taking up to 3 minutes to complete, and I think that this is a bit slow. Do you think that there's a faster / better place to query BQ ?

PS: I am using the google bigquery api to send the queries.

Upvotes: 0

Answers (3)

J.L Valtueña

Reputation: 403

If you have your text files in Google Cloud Storage, Cloud Dataflow could be a natural solution for your situation {1}.

You can use a Google-Provided Template to save some time in the process of creating a Cloud Dataflow pipeline {2}. This way you can create a batch pipeline to move (and transform if you want) data from Google Cloud Storage (files) to BigQuery.

{1}: https://cloud.google.com/dataflow/

{2}: https://cloud.google.com/dataflow/docs/templates/provided-templates#cloud-storage-text-to-bigquery

Upvotes: 0

Mikhail Berlyant

Reputation: 172974

Also check out Potens.io (also available at Cloud Launcher)

The Magnus - Workflow Automator which is part of Potens suite - supports all BigQuery, Cloud Storage and most of Google APIs as well as multiple simple utility type Tasks like BigQuery Task, Export to Storage Task, Loop Task and many more

Disclosure: I am creator of those tools and leader on Potens team

Upvotes: 0

Alexey Maloletkin

Reputation: 1099

There is two options:

You file with data is formatted to be used with BQ load jobs. In this case - you start load job in task queue - and store jobid you get from REST call to datastore. And quit from task queue. As another process you setup appengine cron that run say every minute and just check all running jobids and update status (process from cron run as task queue and using - so it will be under 10 mins limit) if changed and initiate another process if needed. In this case I think it will be pretty scalable
You process file and somehow manually insert rows - in this case best case of action will be using pubsub or again start multiple tasks in taskqueue - by manually splitting data into small pieces and using BQ Streaming insert API - of course it depends on size of your row - but I found that 1000-5000 recs per process works well here.

Upvotes: 1

What is the best place to run BigQuery queries in Google Cloud platform?

Answers (3)

Related Questions