Reputation: 2660
I'm working in BigQuery on Google Analytics data. At various points in developing the query I get the error: "Resources exceeded". I want to further my understanding of what's happening. I've successfully worked around the problem, but only via trial and error.
When I use the explain tool it seems to be the 'compute' part of any query or sub-query that looks to have exceeded resources.
Here's an example of a standard SQL query that succeeds/fails depending on whether certain parts are left in:
SELECT
fullVisitorId,
visitId,
h.type AS type,
h.hitNumber AS hitNumber,
h.eventInfo.eventAction AS action,
LOWER(h.eventInfo.eventCategory) AS category,
h.page.pagePath AS page,
h.page.pageTitle AS landingTitle,
h.page.searchKeyword AS searchTerm,
LEAD(h.page.pagePath) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) AS landingPage,
SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] AS clickTitle,
CASE WHEN LEAD(h.page.pageTitle) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) = SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] THEN true ELSE false END AS searchClick
FROM `project.dataset.ga_sessions_*` AS main, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170430'
AND (
(
h.eventInfo.eventAction = 'click' AND LOWER(h.eventInfo.eventCategory) LIKE '/search%'
)
OR type = 'PAGE'
)
ORDER BY
fullVisitorId ASC, visitId ASC, h.hitNumber ASC
When removing any one of these sets of elements the query runs:
ORDER BY
fullVisitorId ASC, visitId ASC, h.hitNumber ASC
Or:
LEAD(h.page.pagePath) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) AS landingPage,
SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] AS clickTitle,
CASE WHEN LEAD(h.page.pageTitle) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) = SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] THEN true ELSE false END AS searchClick
Or:
When running on a single date partition the entire query runs.
I would describe my current level of understanding as superficial, I know little of the inner workings of GBQ and how it allocates/permits compute resources. I do know that it performs calculations on separate machines where possible. I've heard these described as shards before.
What do I need to know about GBQ compute resources in order to understand why the above will work/not work?
N.B: I only have Tier 1 access, but that doesn't mean I can't gain increased access if I can justify a need. Obviously I don't want to do this with current level of understanding.
Upvotes: 0
Views: 227
Reputation: 11787
I think the only thing that should be causing a problem in your query is the ORDER BY
operation. As you can see in this answer from Jordan, this operation is not parallelizable. You can also check the docs for some ideas of what causes the Resources Exceeded error.
The rest of the query seems to be fine though. I tested your query against our data and it processed almost 300Gb in 20s:
If you still get the error then maybe you are querying quite a high amount of data. This being the case, you could try breaking the query into smaller dates range, querying for less columns, adding some WHERE
conditions to filter out some rows, changing tier and so on.
Upvotes: 3