What are best practices for using BigQuery from App Engine Standard?

Question

I am using App Engine Standard Environment (autoscaled), which means I have a limit of 10 mins before a request is cancelled.

Goal is to query data from BigQuery in regular intervals and, for each record, create a task in the task queue, so that records can be processed in the background.

Instructions at https://cloud.google.com/bigquery/create-simple-app-api state to wait for a job like this:

// Create a job ID so that we can safely retry.
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());

// Wait for the query to complete.
queryJob = queryJob.waitFor();

Problem is the 10-minute limit, as BigQuery queries are processed in the background and it may take some time until the result becomes available, so I may not be able to process the response in the same endpoint call.

Is there a way to receive a callback from BigQuery at a URL, when a query is ready?
Is there a smarter way to process data from BigQuery in App Engine Standard?

I know I can configure App Engine to extend the maximum time per request, but that can hardly be the solution.

Jofre · Accepted Answer

The best option is handling long-running tasks as BQ does: provide a job-id and allow clients to query it, returning a 202 while the query has not finished, and a 200 with the result once the result is ready to be consumed by the client.

Furthermore, the 202 can return a body, so you can set different status to the clients (e.g. "Queued", "Running", "Processing results", ...).

On the server side, you start a query and, as soon as BQ returns a job ID, store it in some persistent storage (I would choose Datastore, but it could be memcache, a cloudSQL instance, or even a file in GCS).

Then you just need to create a cron job that checks BQ for the status of the unfinished queries, and updates their status in your persistent storage accordingly. Once the BQ job is finished, you can retrieve the results and store them to have them ready when the client checks your service.

As an example, this are the BQ API queries you should do within your app (in here done with curl to provide an example, you can later translate to any language using the idiomatic libraries):

Create the job, retrieve the job id from the response, and store it:

PROJECT=$(gcloud config get-value project)
QUERY='SELECT * FROM `bigquery-samples.wikipedia_benchmark.Wiki1k` limit 0'
curl -H"Authorization: Bearer $(gcloud auth print-access-token)" -H'content-type:application/json' https://www.googleapis.com/bigquery/v2/projects/$PROJECT/jobs -d"
{
 \"configuration\": {
  \"query\": {
   \"query\": \"$QUERY\",
   \"useLegacySql\": false
  }
 },
 \"jobReference\": {
  \"projectId\": \"$PROJECT\"
 }
}"|jq -r .jobReference.jobId >> running_jobs

Keep querying the BQ API for the job status. (this could be your cron job):

for job in $(cat running_jobs); do
  if [ $(curl -H"Authorization: Bearer $(gcloud auth print-access-token)" https://www.googleapis.com/bigquery/v2/projects/$PROJECT/jobs/$job|jq -r .status.state) = "DONE" ]; then
    # here your processing part including your callback
    # then remove the job from the list of running jobs
    sed -i "/$job/d" ./running_jobs
  fi
done

You can try this in the cloud shell.

What are best practices for using BigQuery from App Engine Standard?

Answers (2)

Related Questions