Michael B.
Michael B.

Reputation: 995

BigQuery Internal Error with `pageToken` when running in GCP

I run into this error with BigQuery:

"An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 5034996"

Two application use the same way with pageToken to paginate trough big result sets.

  1. run query with initital startIndex: 0, maxResults: 10
  2. get result together with pageToken
  3. send to client
  4. ... some time may pass ...
  5. request "next page": use pageToken together with maxResults: 10 to get the next result
  6. repeat from 3.

NodeJS 16, @google-cloud/bigquery 6.0.3

Locally (Windows 10), for both application every thing works, pagination with pageToken returns results quite fast (<5s). All steps 1 to 6 and requesting multiple next pages one after another works, even tested that the pageToken still works after 60min+.

Production Cloud has problems: the initial query works always, but as soon as a pageToken is given, the query fails after ~15s+, even when "requested the next page directly (1-5s. delay) after getting the first page". Steps 1 to 3 work, but requesting next page fails nearly most time, it's very rare that it doesn't fail.

Production uses Google Cloud Functions and Google Cloud Run to serve the applications.

One application is an internal experiment, this application uses the same dataset + table when running locally and when running in "production".

The other application uses the same dataset but different tables for local/production - and is in another Google Cloud project than the first application.

Thus project-level quotas or e.g. different table setups locally/prod shouldn't cause the issue here (hopefully).

Example code used:

const [rows, , jobQueryResults] = await (job.getQueryResults(
    ('maxResults' in paginate ?
        {
            // there seems to be no need to give the startIndex again / but tested it also with giving a stable `0` index; atm. as soon as a pageToken is given the `startIndex` is omitted
            startIndex: paginate.pageToken ? undefined : paginate.startIndex,
            maxResults: paginate.maxResults,
            pageToken: paginate.pageToken,
        } : undefined) as QueryResultsOptions,
) as Promise<QueryRowsResponse>)

What wonders me is that the pageToken isn't shown in the log of the failure, the maxResults is visible:

google log bigquery failure

Edit

The error suggests some SLA problem, one of the GCP projects only include experimentals (non public) applications, thus any traffic/usage can be easily monitored.

The monitoring for BigQuery in that project shows roughly 1 job per 1 second when testing it, job 1+2 where "load without pageToken" -> 3 used the pageToken from 2 and run into an error, the retries must happen from BigQuery side, there is nothing implemented from my side (using only the official BigQuery package).

bigquery monitoring

Upvotes: 0

Views: 646

Answers (1)

jrfarrant
jrfarrant

Reputation: 1

We experienced this issue when the pageToken from a prior BigQuery job was used in the getQueryResults() method for a successive job, AND the underlying table had been modified between the prior job and the successive job. Under these conditions, the pageToken is not valid and BigQuery throws a generic error when executing the getQueryResults() method.

The solution that worked for us was to avoid running multiple jobs when paginating, and instead use the .job() method to retrieve the original job, then call the getQueryResults() method with the pageToken:

await bqClient
  .job(jobId, {
    location: jobLocation,
  })
  .getQueryResults({
    maxResults,
    pageToken,
  });

In our case, the issue was not immediately apparent since BigQuery will accept the pageToken from a prior job in the case where the two jobs return a result that matches exactly, which is the case in a relatively quiet development environment, but is not typically the case in a production environment with continuous updates via streaming insert.

Upvotes: 0

Related Questions