Reputation: 468
I have a pageviews
table that contains the paths of pages visited by customers within each session along with a timestamp. One of the columns is landing_page
, which is populated for each row in the table (each session's pageviews will all have the same landing_page
).
I'm trying to create similar columns for second_page
, third_page
and fourth_page
, which will show the paths of the 2nd, 3rd and 4th pages visited in a session. I'm able to do this using NTH_VALUE, but I want to handle a specific case where a customer visits the same page multiple times.
For example, let's say the customer visits pages in the following order -
With my query (below), I get the second_page
= "www.dummywebsite.com/products" as well as the third_page
= "www.dummywebsite.com/products". What I'd like is for third_page
to be "www.dummywebsite.com/products/prodA" instead.
How can I edit the below query to get the desired result?
SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
NTH_VALUE(path,2 ignore nulls) OVER(win) second_page_path,
NTH_VALUE(path,3 ignore nulls) OVER(win) third_page_path,
NTH_VALUE(path,4 ignore nulls) OVER(win) fourth_page_path
FROM pageviews
WINDOW win AS (PARTITION BY user_id, session_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Upvotes: 0
Views: 996
Reputation: 1946
Would be easier if you'd put some sample data up but something along these lines should remove the paths that are a duplicate of previous paths
WITH CTE AS
(SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
if(path=lag(path) OVER (PARTITION BY session_id, user_id ORDER BY created_at),FALSE,TRUE) distinctPath
FROM pageviews)
SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
NTH_VALUE(path,2 ignore nulls) OVER(win) second_page_path,
NTH_VALUE(path,3 ignore nulls) OVER(win) third_page_path,
NTH_VALUE(path,4 ignore nulls) OVER(win) fourth_page_path
FROM pageviews
WHERE distinctPath = TRUE
WINDOW win AS (PARTITION BY user_id, session_id ORDER BY created_at ROWS BETWEEN
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Upvotes: 1
Reputation: 172993
Below is for BigQuery Standard SQL with minimal changes to your original query
#standardSQL
SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
NTH_VALUE(distinct_path,2 IGNORE NULLS) OVER(win) second_page_path,
NTH_VALUE(distinct_path,3 IGNORE NULLS) OVER(win) third_page_path,
NTH_VALUE(distinct_path,4 IGNORE NULLS) OVER(win) fourth_page_path
FROM (
SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
IF(path = LAG(path) OVER(PARTITION BY user_id, session_id ORDER BY created_at), NULL, path) distinct_path
FROM `project.dataset.pageviews`
)
WINDOW win AS (PARTITION BY user_id, session_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Upvotes: 2
Reputation: 31993
below query will change the third_page_path
SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
NTH_VALUE(path,2 ignore nulls) OVER(win) second_page_path,
NTH_VALUE(path,4 ignore nulls) OVER(win) third_page_path,
NTH_VALUE(path,4 ignore nulls) OVER(win) fourth_page_path
FROM pageviews
WINDOW win AS (PARTITION BY user_id, session_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Upvotes: 0