user2694306
user2694306

Reputation: 4050

BigQuery - Get distinct values up to a given point in time

I have a table that contains the actions of a user (e.g. view page, click button, etc). Each row contains a user_id a date (created_on) and the name of the action. I would like to create a query that for each date it would create a nested field of the distinct actions taken up to and including that date. For example, I have a table called user_actions:

-------------------------------------
| user_id |    date    |   action   |
-------------------------------------
|    1    | 2018-04-01 |    click   |
|    2    | 2018-04-01 |     view   |
|    1    | 2018-04-02 |     view   |
|    2    | 2018-04-02 |     view   |
|    2    | 2018-04-03 |      buy   |
-------------------------------------

would result in

-------------------------------------
| user_id |    date    |   actions  |
-------------------------------------
|    1    | 2018-04-01 |     click  |
|    2    | 2018-04-01 |     view   |
|    1    | 2018-04-02 |     click  |
|    2    | 2018-04-02 |     view   |
|         |            |     view   |
|    2    | 2018-04-03 |     view   |
|    2    |            |      buy   |
-------------------------------------

In the second table, actions is a nested repeated field. I know that for a single point in time I can use something similar to the following:

SELECT
    user_id,
    date,
    ARRAY(action)
FROM
    user_actions
GROUP BY
    1,2

However I am unsure how to extend this out to provide the same calculation for every date in the original table and only look at the time before the date field.

Any help would be greatly appreciated. Thanks!

Upvotes: 0

Views: 71

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173028

create a nested field of the distinct actions taken up to and including that date

Below is for BigQuery Standrad SQL

#standardSQL
SELECT user_id, date, 
  ARRAY(
    SELECT DISTINCT action FROM UNNEST(actions) action
  ) actions
FROM (
  SELECT user_id, date, ARRAY_AGG(action) OVER(win) actions
  FROM `project.dataset.table`
  WINDOW win AS (
    PARTITION BY user_id ORDER BY date 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
)

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 user_id, '2018-04-01' date, 'click' action UNION ALL
  SELECT 2, '2018-04-01', 'view' UNION ALL
  SELECT 1, '2018-04-02', 'view' UNION ALL
  SELECT 2, '2018-04-02', 'view' UNION ALL
  SELECT 2, '2018-04-03', 'buy' 
)
SELECT user_id, date, 
  ARRAY(
    SELECT DISTINCT action FROM UNNEST(actions) action
  ) actions
FROM (
  SELECT user_id, date, ARRAY_AGG(action) OVER(win) actions
  FROM `project.dataset.table`
  WINDOW win AS (
    PARTITION BY user_id ORDER BY date 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
)
-- ORDER BY date, user_id    

with result

enter image description here

Update

Below version supports more generic case with multiple actions for same user within same day (I realized it was not a case with my initial answer)

#standardSQL
SELECT user_id, date, 
  ARRAY(
    SELECT DISTINCT action FROM UNNEST(SPLIT(actions)) action
  ) actions
FROM (
  SELECT user_id, date , STRING_AGG(actions) OVER(win) actions
  FROM (
    SELECT user_id, date, STRING_AGG(DISTINCT action) actions
    FROM `project.dataset.table`
    GROUP BY user_id, date
  )
  WINDOW win AS (
    PARTITION BY user_id ORDER BY date 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
)

You can test it with below sample data (note extyra row with activity = 'play' )

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 user_id, DATE '2018-04-01' date, 'click' action UNION ALL
  SELECT 2, '2018-04-01', 'view' UNION ALL
  SELECT 1, '2018-04-02', 'view' UNION ALL
  SELECT 1, '2018-04-02', 'play' UNION ALL
  SELECT 2, '2018-04-02', 'view' UNION ALL
  SELECT 2, '2018-04-03', 'buy' 
)
SELECT user_id, date, 
  ARRAY(
    SELECT DISTINCT action FROM UNNEST(SPLIT(actions)) action
  ) actions
FROM (
  SELECT user_id, date , STRING_AGG(actions) OVER(win) actions
  FROM (
    SELECT user_id, date, STRING_AGG(DISTINCT action) actions
    FROM `project.dataset.table`
    GROUP BY user_id, date
  )
  WINDOW win AS (
    PARTITION BY user_id ORDER BY date 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
)
-- ORDER BY date, user_id

with result

enter image description here

Upvotes: 2

Related Questions