Reputation: 361
I've been trying to run this query and I keep getting the resources exceeded error despite setting Allow Large results to true and setting the destination table. I tried adding a limit as well. Is there any way I can optimize the query to avoid this error?
SELECT
repo_name,
commit,
parent,
subject,
message,
difference.*
FROM
FLATTEN(FLATTEN([bigquery-public-data:github_repos.commits], repo_name), parent)
WHERE
(REGEXP_MATCH(message,r'(?i:null pointer dereference)'))
LIMIT
5000000
Thank you very much!
Upvotes: 1
Views: 193
Reputation: 33705
It worked for me (and took about a minute and a half):
#standardSQL
SELECT
repo_name,
commit,
parent,
subject,
message,
difference
FROM
`bigquery-public-data.github_repos.commits`
CROSS JOIN UNNEST(repo_name) AS repo_name
CROSS JOIN UNNEST(parent) AS parent
WHERE
REGEXP_CONTAINS(message, r'(?i:null pointer dereference)');
I suspect that legacy SQL in BigQuery is worse at handling this kind of thing. As an aside, though: is there a reason that you want to flatten the repo_name
and parent
arrays? Without flattening, the query would produce 37,073 rows, whereas with flattening, it produces 11,419,166. Depending on the kind of analysis that you want to do, you may not need to flatten the data at all.
Edit: since it sounds like flattening was only necessary to work around legacy SQL's limitations related to independently repeating fields, you can remove the CROSS JOIN
s and avoid flattening:
#standardSQL
SELECT
repo_name,
commit,
parent,
subject,
message,
difference
FROM
`bigquery-public-data.github_repos.commits`
WHERE
REGEXP_CONTAINS(message, r'(?i:null pointer dereference)');
This is faster, too--it takes about 15 seconds instead of 90.
Upvotes: 2