Query for selecting sequence of hits consumes large quantity of data

Question

I'm trying to measure the conversion rate through alternative funnels on a website. My query has been designed to output a count of sessions that viewed the relevant start URL and a count of sessions that hit the confirmation page strictly in that order. It does this by comparing the times of the hits.

My query appears to return accurate figures, but in doing so selects a massive quantity of data, just under 23GB for what I've attempted to limit to one hour of one day. I don't seem to have written my query in a particularly efficient way and gather that I'll use up all of my company's data quota fairly quickly if I continue to use it.

Here's the offending query in full:

WITH 
s1 AS (
  SELECT
      fullVisitorId,
      visitId,    
      LOWER(h.page.pagePath),
      device.deviceCategory AS platform,
      MIN(h.time) AS s1_time
  FROM
    `project.dataset.ga_sessions_*`, UNNEST(hits) AS h 
  WHERE
    _TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
  AND
    LOWER(h.page.pagePath) LIKE '{funnel-start-url-1}%' OR LOWER(h.page.pagePath) LIKE '{funnel-start-url-2}%'
  AND
    totals.visits = 1
  AND
    h.hour  < 21
  AND
    h.hour >= 20      
  AND
    h.type = "PAGE"
  GROUP BY
    path,
    platform,
    fullVisitorId,
    visitId
  ORDER BY
    fullVisitorId ASC, visitId ASC
),

confirmations AS (
  SELECT
      fullVisitorId,
      visitId,
      MIN(h.time) AS confirmation_time
  FROM
    `project.dataset.ga_sessions_*`, UNNEST(hits) AS h 
  WHERE
    _TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
  AND
    h.type = "PAGE"      
  AND
    LOWER(h.page.pagePath) LIKE '{confirmation-url-1}%' OR LOWER(h.page.pagePath) LIKE '{confirmations-url-2}%'
  AND
    totals.visits = 1
  AND
    h.hour  < 21
  AND
    h.hour >= 20
  GROUP BY
    fullVisitorId,
    visitId
)

SELECT
  platform,
  path,
  COUNT(path) AS Views,
  SUM(
    CASE 
      WHEN s1.s1_time < confirmations.confirmation_time 
      THEN 1 
      ELSE 0 
    END
  ) AS SubsequentPurchases
FROM
  s1
LEFT JOIN
  confirmations
ON
  s1.fullVisitorId = confirmations.fullVisitorId
AND
  s1.visitId = confirmations.visitId
GROUP BY
  platform,
  path

What is it about this query that means it has to process so much data? Is there a better way to get at these numbers. Ideally any method should be able to measure the multiple different routes, but I'd settle for sustainability at this point.

Willian Fuks · Accepted Answer

There are probably a few ways that you can optimize your query but it seems like it won't entirely solve your issue (as I'll further try to explain).

As for the query, this one does the same but avoids re-selecting data and the LEFT JOIN operation:

SELECT
    path,
    platform,
    COUNT(path) views,
    COUNT(CASE WHEN last_hn > first_hn THEN 1 END) SubsequentPurchases
from(
SELECT
    fv,
    v,
    platform,
    path,
    first_hn,
    MAX(last_hn) OVER(PARTITION BY fv, v) last_hn
from(
SELECT 
    fullvisitorid fv,
    visitid v,
    device.devicecategory platform,
    LOWER(hits.page.pagepath) path,
    MIN(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product') THEN hits.hitnumber ELSE null END) first_hn,
    MAX(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'success') then hits.hitnumber ELSE null END) last_hn
FROM `project_id.data_set.ga_sessions_20170112`,
UNNEST(hits) hits
WHERE 
    REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product|success')
    AND totals.visits = 1
    AND hits.type = 'PAGE'
GROUP BY 
    fv, v, path, platform
)
)
GROUP BY
    path, platform
HAVING NOT REGEXP_CONTAINS(path, r'success')

first_hn tracks the funnel-start-url (in which I used the terms "catalog" and "product") and the last_hn tracks the confirmation URLs (which I used the term "success" but could add more values in the regex selector). Also, by using MIN and MAX operations and the analytical functions you can have some optimizations in your query.

There are a few points though to make here:

If you insert WHERE hits.hithour = 20, BigQuery still has to scan the whole table to find what is 20 from what is not. That means that the 23Gbs you observed still accounts for the whole day.
For comparison, I tested your query against our ga_sessions and it took around 31 days to reach 23Gb of data. As you are not selecting that many fields, it shouldn't be that easy to reach this amount unless you have a considerable high traffic volume coming from your data source.
Given current pricing for BigQuery, 23Gbs would consume you roughly $0.11 to process, which is quite cost-efficient.
Another thing I could imagine is that you are running this query several times a day and have no cache or some proper architecture for these operations.

All this being said, you can optimize your query but I suspect it won't change that much in the end as it seems you have quite a high volume of data. Processing 23Gbs a few times shouldn't be a problem but if you are concerned that it will reach your quota then it seems like you are running several times a day this query.

This being the case, see if using either some cache flag or saving the results into another table and then querying it instead will help. Also, you could start saving daily tables with just the sessions you are interested in (having the URL patterns you are looking for) and then running your final query in these newly created tables, which would allow you to query over a bigger range of days spending much less for that.

Query for selecting sequence of hits consumes large quantity of data

Answers (1)

Related Questions