goose
goose

Reputation: 2660

Understanding what leads to "resources exceeded" error in GBQ?

I'm working in BigQuery on Google Analytics data. At various points in developing the query I get the error: "Resources exceeded". I want to further my understanding of what's happening. I've successfully worked around the problem, but only via trial and error.

When I use the explain tool it seems to be the 'compute' part of any query or sub-query that looks to have exceeded resources.

Here's an example of a standard SQL query that succeeds/fails depending on whether certain parts are left in:

  SELECT
    fullVisitorId,
    visitId,
    h.type AS type,
    h.hitNumber AS hitNumber,
    h.eventInfo.eventAction AS action,
    LOWER(h.eventInfo.eventCategory) AS category,
    h.page.pagePath AS page,
    h.page.pageTitle AS landingTitle,
    h.page.searchKeyword AS searchTerm,
    LEAD(h.page.pagePath) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) AS landingPage,
    SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] AS clickTitle,
    CASE WHEN LEAD(h.page.pageTitle) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) = SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] THEN true ELSE false END AS searchClick     
 FROM `project.dataset.ga_sessions_*` AS main, UNNEST(hits) AS h 
  WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170430'
AND (
  (
    h.eventInfo.eventAction = 'click' AND LOWER(h.eventInfo.eventCategory) LIKE '/search%'
  )
    OR type = 'PAGE'
)
ORDER BY
  fullVisitorId ASC, visitId ASC, h.hitNumber ASC

When removing any one of these sets of elements the query runs:

ORDER BY
  fullVisitorId ASC, visitId ASC, h.hitNumber ASC

Or:

LEAD(h.page.pagePath) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) AS landingPage,
SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] AS clickTitle,
CASE WHEN LEAD(h.page.pageTitle) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) = SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] THEN true ELSE false END AS searchClick

Or:

When running on a single date partition the entire query runs.

I would describe my current level of understanding as superficial, I know little of the inner workings of GBQ and how it allocates/permits compute resources. I do know that it performs calculations on separate machines where possible. I've heard these described as shards before.

What do I need to know about GBQ compute resources in order to understand why the above will work/not work?

N.B: I only have Tier 1 access, but that doesn't mean I can't gain increased access if I can justify a need. Obviously I don't want to do this with current level of understanding.

Upvotes: 0

Views: 227

Answers (1)

Willian Fuks
Willian Fuks

Reputation: 11787

I think the only thing that should be causing a problem in your query is the ORDER BY operation. As you can see in this answer from Jordan, this operation is not parallelizable. You can also check the docs for some ideas of what causes the Resources Exceeded error.

The rest of the query seems to be fine though. I tested your query against our data and it processed almost 300Gb in 20s:

enter image description here

If you still get the error then maybe you are querying quite a high amount of data. This being the case, you could try breaking the query into smaller dates range, querying for less columns, adding some WHERE conditions to filter out some rows, changing tier and so on.

Upvotes: 3

Related Questions