Reputation: 1
I am extremely new to BigQuery and I am working on a project to identify trends by hour for users navigating from the home page to every page in our site based on GA data, I have started to piece together code for how to get from the home page to a page called "/wait-times", how can I do this for home to every other page combination on our site?
hourly_wait_times = pd.io.gbq.read_gbq('''
select
visitStartTime_HOUR,
visitStartTime_DAY,
sum(waitTimes) waitTimes, #
sum(allPageViews) allPageViews #
from (
SELECT
date,
EXTRACT(HOUR FROM
DATETIME(TIMESTAMP_SECONDS(
visitStartTime),'America/New_York')) visitStartTime_HOUR,
EXTRACT(DATE FROM
DATETIME(TIMESTAMP_SECONDS(
visitStartTime),'America/New_York')) visitStartTime_DAY,
case when h.page.pagePath = "/wait-times" then 1 else 0 end waitTimes,
1 allPageViews,
h.page.hostname,
h.page.pagePath,
LAG(h.page.pagePath, 1)
OVER (PARTITION BY fullVisitorId, visitId, DATE ORDER BY h.hitNumber ASC) AS PreviousPage,
LAG(h.page.pagePath, 1)
OVER (PARTITION BY fullVisitorId, visitId, DATE ORDER BY h.hitNumber DESC) AS NextPage
FROM `MY-GA-SESSION`.`ga_sessions_*`, unnest(hits) h
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 3 MONTH))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
and h.type = 'PAGE'
and h.page.hostname = 'www.myhostname'
) A
where PreviousPage = '/'
group by visitStartTime_DAY, visitStartTime_HOUR
order by visitStartTime_DAY, visitStartTime_HOUR
''',
project_id=project_id)
I would like the table to look like this, how can this be done?
+-------+------------+----------+------+-------+-----------+--+
| PageA | PageB | date | hour | count | pageviews | |
+-------+------------+----------+------+-------+-----------+--+
| / | /donations | 12/1/19 | 0 | 0 | 100 | |
| / | | | 1 | 4 | 254 | |
| / | | | ... | | | |
| | | | 23 | 5 | 500 | |
| | | ... | | | | |
| / | | 3/1/19 | 0 | 1 | 183 | |
| | | | ... | | | |
| / | | | | | | |
| / | | | 23 | 6 | 600 | |
| / | /store | 12/1/19 | 0 | 2 | 100 | |
| | | | ... | | | |
+-------+------------+----------+------+-------+-----------+--+
Upvotes: 0
Views: 208
Reputation: 3628
I would probably structure it a bit like this.
with pages as (
select date, visitID, visitStartTime, h.page.pagePath, h.hitNumber
from `MY-GA-SESSION`.`ga_sessions_*`, unnest(hits) h
where h.type = 'PAGE'
and _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 3 MONTH))
and FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
),
temp as (
select *,
lead(pagePath,1) over (partition by visitID order by hitNumber asc) as nextPagePath, -- Get the next page path
row_number() over (partition by visitID order by hitNumber asc) as pagenum -- will need later on in we want to filter to landing page only
from pages
),
temp2 as (
select
pagePath as pageA,
nextPagePath as pageB,
EXTRACT(DATE FROM DATETIME(TIMESTAMP_SECONDS(visitStartTime),'America/New_York')) as date,
EXTRACT(HOUR FROM DATETIME(TIMESTAMP_SECONDS(visitStartTime),'America/New_York')) as hour,
count(*) as a_to_b_count
from temp
where pagePath = '/' --whatever your homepage (or pageA is)
and pagenum = 1 -- if you want pageA as a landing page only, delete if you want all pageA->pageB visits
)
select pageA, pageB, array_agg(struct(date,hour,a_to_b_count) order by 1 asc, 2 asc) as visit_info
from temp2
group by 1,2
Note: I'm not exactly sure what you mean by pageviews in this context, my query focuses on the pageA->pageB transition. If you are looking for total pageviews associated after each A->B transition, I would wrap that in a CTE and then join it back in by visitID
before aggregating.
Upvotes: 1