Reputation: 2652
I'm trying to measure the conversion rate through alternative funnels on a website. My query has been designed to output a count of sessions that viewed the relevant start URL and a count of sessions that hit the confirmation page strictly in that order. It does this by comparing the times of the hits.
My query appears to return accurate figures, but in doing so selects a massive quantity of data, just under 23GB for what I've attempted to limit to one hour of one day. I don't seem to have written my query in a particularly efficient way and gather that I'll use up all of my company's data quota fairly quickly if I continue to use it.
Here's the offending query in full:
WITH
s1 AS (
SELECT
fullVisitorId,
visitId,
LOWER(h.page.pagePath),
device.deviceCategory AS platform,
MIN(h.time) AS s1_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
LOWER(h.page.pagePath) LIKE '{funnel-start-url-1}%' OR LOWER(h.page.pagePath) LIKE '{funnel-start-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
AND
h.type = "PAGE"
GROUP BY
path,
platform,
fullVisitorId,
visitId
ORDER BY
fullVisitorId ASC, visitId ASC
),
confirmations AS (
SELECT
fullVisitorId,
visitId,
MIN(h.time) AS confirmation_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
h.type = "PAGE"
AND
LOWER(h.page.pagePath) LIKE '{confirmation-url-1}%' OR LOWER(h.page.pagePath) LIKE '{confirmations-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
GROUP BY
fullVisitorId,
visitId
)
SELECT
platform,
path,
COUNT(path) AS Views,
SUM(
CASE
WHEN s1.s1_time < confirmations.confirmation_time
THEN 1
ELSE 0
END
) AS SubsequentPurchases
FROM
s1
LEFT JOIN
confirmations
ON
s1.fullVisitorId = confirmations.fullVisitorId
AND
s1.visitId = confirmations.visitId
GROUP BY
platform,
path
What is it about this query that means it has to process so much data? Is there a better way to get at these numbers. Ideally any method should be able to measure the multiple different routes, but I'd settle for sustainability at this point.
Upvotes: 1
Views: 472
Reputation: 11787
There are probably a few ways that you can optimize your query but it seems like it won't entirely solve your issue (as I'll further try to explain).
As for the query, this one does the same but avoids re-selecting data and the LEFT JOIN
operation:
SELECT
path,
platform,
COUNT(path) views,
COUNT(CASE WHEN last_hn > first_hn THEN 1 END) SubsequentPurchases
from(
SELECT
fv,
v,
platform,
path,
first_hn,
MAX(last_hn) OVER(PARTITION BY fv, v) last_hn
from(
SELECT
fullvisitorid fv,
visitid v,
device.devicecategory platform,
LOWER(hits.page.pagepath) path,
MIN(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product') THEN hits.hitnumber ELSE null END) first_hn,
MAX(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'success') then hits.hitnumber ELSE null END) last_hn
FROM `project_id.data_set.ga_sessions_20170112`,
UNNEST(hits) hits
WHERE
REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product|success')
AND totals.visits = 1
AND hits.type = 'PAGE'
GROUP BY
fv, v, path, platform
)
)
GROUP BY
path, platform
HAVING NOT REGEXP_CONTAINS(path, r'success')
first_hn
tracks the funnel-start-url
(in which I used the terms "catalog" and "product") and the last_hn
tracks the confirmation URLs (which I used the term "success" but could add more values in the regex
selector). Also, by using MIN
and MAX
operations and the analytical functions you can have some optimizations in your query.
There are a few points though to make here:
WHERE hits.hithour = 20
, BigQuery still has to scan the whole table to find what is 20 from what is not. That means that the 23Gbs you observed still accounts for the whole day.cache
or some proper architecture for these operations.All this being said, you can optimize your query but I suspect it won't change that much in the end as it seems you have quite a high volume of data. Processing 23Gbs a few times shouldn't be a problem but if you are concerned that it will reach your quota then it seems like you are running several times a day this query.
This being the case, see if using either some cache
flag or saving the results into another table and then querying it instead will help. Also, you could start saving daily tables with just the sessions you are interested in (having the URL patterns you are looking for) and then running your final query in these newly created tables, which would allow you to query over a bigger range of days spending much less for that.
Upvotes: 1