AnnaR
AnnaR

Reputation: 361

BigQuery - Resources exceeded during query execution, with Allow Large Results set

I've been trying to run this query and I keep getting the resources exceeded error despite setting Allow Large results to true and setting the destination table. I tried adding a limit as well. Is there any way I can optimize the query to avoid this error?

SELECT
  repo_name,
  commit,
  parent,
  subject,
  message,
  difference.*
FROM
  FLATTEN(FLATTEN([bigquery-public-data:github_repos.commits], repo_name), parent)
WHERE
  (REGEXP_MATCH(message,r'(?i:null pointer dereference)'))
LIMIT
  5000000

Thank you very much!

Upvotes: 1

Views: 193

Answers (1)

Elliott Brossard
Elliott Brossard

Reputation: 33705

It worked for me (and took about a minute and a half):

#standardSQL
SELECT
  repo_name,
  commit,
  parent,
  subject,
  message,
  difference
FROM
  `bigquery-public-data.github_repos.commits`
CROSS JOIN UNNEST(repo_name) AS repo_name
CROSS JOIN UNNEST(parent) AS parent
WHERE
  REGEXP_CONTAINS(message, r'(?i:null pointer dereference)');

I suspect that legacy SQL in BigQuery is worse at handling this kind of thing. As an aside, though: is there a reason that you want to flatten the repo_name and parent arrays? Without flattening, the query would produce 37,073 rows, whereas with flattening, it produces 11,419,166. Depending on the kind of analysis that you want to do, you may not need to flatten the data at all.

Edit: since it sounds like flattening was only necessary to work around legacy SQL's limitations related to independently repeating fields, you can remove the CROSS JOINs and avoid flattening:

#standardSQL
SELECT
  repo_name,
  commit,
  parent,
  subject,
  message,
  difference
FROM
  `bigquery-public-data.github_repos.commits`
WHERE
  REGEXP_CONTAINS(message, r'(?i:null pointer dereference)');

This is faster, too--it takes about 15 seconds instead of 90.

Upvotes: 2

Related Questions