Moon_Watcher
Moon_Watcher

Reputation: 468

Finding distinct values with NTH_VALUE

I have a pageviews table that contains the paths of pages visited by customers within each session along with a timestamp. One of the columns is landing_page, which is populated for each row in the table (each session's pageviews will all have the same landing_page).

I'm trying to create similar columns for second_page, third_page and fourth_page, which will show the paths of the 2nd, 3rd and 4th pages visited in a session. I'm able to do this using NTH_VALUE, but I want to handle a specific case where a customer visits the same page multiple times.

For example, let's say the customer visits pages in the following order -

  1. www.dummywebsite.com/
  2. www.dummywebsite.com/products
  3. www.dummywebsite.com/products
  4. www.dummywebsite.com/products/prodA
  5. www.dummywebsite.com/cart

With my query (below), I get the second_page = "www.dummywebsite.com/products" as well as the third_page = "www.dummywebsite.com/products". What I'd like is for third_page to be "www.dummywebsite.com/products/prodA" instead.

How can I edit the below query to get the desired result?

SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
       NTH_VALUE(path,2 ignore nulls) OVER(win) second_page_path,
       NTH_VALUE(path,3 ignore nulls) OVER(win) third_page_path,
       NTH_VALUE(path,4 ignore nulls) OVER(win) fourth_page_path
   FROM pageviews
WINDOW win AS (PARTITION BY user_id, session_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Upvotes: 0

Views: 996

Answers (3)

Bobbylank
Bobbylank

Reputation: 1946

Would be easier if you'd put some sample data up but something along these lines should remove the paths that are a duplicate of previous paths

WITH CTE AS
(SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
if(path=lag(path) OVER (PARTITION BY session_id, user_id ORDER BY created_at),FALSE,TRUE) distinctPath
FROM pageviews)

SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
   NTH_VALUE(path,2 ignore nulls) OVER(win) second_page_path,
   NTH_VALUE(path,3 ignore nulls) OVER(win) third_page_path,
   NTH_VALUE(path,4 ignore nulls) OVER(win) fourth_page_path
FROM pageviews
WHERE distinctPath = TRUE
WINDOW win AS (PARTITION BY user_id, session_id ORDER BY created_at ROWS BETWEEN 
 UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below is for BigQuery Standard SQL with minimal changes to your original query

#standardSQL
SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
  NTH_VALUE(distinct_path,2 IGNORE NULLS) OVER(win) second_page_path,
  NTH_VALUE(distinct_path,3 IGNORE NULLS) OVER(win) third_page_path,
  NTH_VALUE(distinct_path,4 IGNORE NULLS) OVER(win) fourth_page_path
FROM (
  SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
    IF(path = LAG(path) OVER(PARTITION BY user_id, session_id ORDER BY created_at), NULL, path) distinct_path 
  FROM `project.dataset.pageviews`
)
WINDOW win AS (PARTITION BY user_id, session_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Upvotes: 2

Zaynul Abadin Tuhin
Zaynul Abadin Tuhin

Reputation: 31993

below query will change the third_page_path

SELECT pageview_id, session_id, user_id, created_at, landing_page, path,
           NTH_VALUE(path,2 ignore nulls) OVER(win) second_page_path,
           NTH_VALUE(path,4 ignore nulls) OVER(win) third_page_path,
           NTH_VALUE(path,4 ignore nulls) OVER(win) fourth_page_path
       FROM pageviews
    WINDOW win AS (PARTITION BY user_id, session_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Upvotes: 0

Related Questions