Reputation: 4050
I have a table that contains the actions of a user (e.g. view page, click button, etc). Each row contains a user_id a date (created_on) and the name of the action. I would like to create a query that for each date it would create a nested field of the distinct actions taken up to and including that date. For example, I have a table called user_actions
:
-------------------------------------
| user_id | date | action |
-------------------------------------
| 1 | 2018-04-01 | click |
| 2 | 2018-04-01 | view |
| 1 | 2018-04-02 | view |
| 2 | 2018-04-02 | view |
| 2 | 2018-04-03 | buy |
-------------------------------------
would result in
-------------------------------------
| user_id | date | actions |
-------------------------------------
| 1 | 2018-04-01 | click |
| 2 | 2018-04-01 | view |
| 1 | 2018-04-02 | click |
| 2 | 2018-04-02 | view |
| | | view |
| 2 | 2018-04-03 | view |
| 2 | | buy |
-------------------------------------
In the second table, actions is a nested repeated field. I know that for a single point in time I can use something similar to the following:
SELECT
user_id,
date,
ARRAY(action)
FROM
user_actions
GROUP BY
1,2
However I am unsure how to extend this out to provide the same calculation for every date in the original table and only look at the time before the date
field.
Any help would be greatly appreciated. Thanks!
Upvotes: 0
Views: 71
Reputation: 173028
create a nested field of the distinct actions taken up to and including that date
Below is for BigQuery Standrad SQL
#standardSQL
SELECT user_id, date,
ARRAY(
SELECT DISTINCT action FROM UNNEST(actions) action
) actions
FROM (
SELECT user_id, date, ARRAY_AGG(action) OVER(win) actions
FROM `project.dataset.table`
WINDOW win AS (
PARTITION BY user_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 user_id, '2018-04-01' date, 'click' action UNION ALL
SELECT 2, '2018-04-01', 'view' UNION ALL
SELECT 1, '2018-04-02', 'view' UNION ALL
SELECT 2, '2018-04-02', 'view' UNION ALL
SELECT 2, '2018-04-03', 'buy'
)
SELECT user_id, date,
ARRAY(
SELECT DISTINCT action FROM UNNEST(actions) action
) actions
FROM (
SELECT user_id, date, ARRAY_AGG(action) OVER(win) actions
FROM `project.dataset.table`
WINDOW win AS (
PARTITION BY user_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
-- ORDER BY date, user_id
with result
Update
Below version supports more generic case with multiple actions for same user within same day (I realized it was not a case with my initial answer)
#standardSQL
SELECT user_id, date,
ARRAY(
SELECT DISTINCT action FROM UNNEST(SPLIT(actions)) action
) actions
FROM (
SELECT user_id, date , STRING_AGG(actions) OVER(win) actions
FROM (
SELECT user_id, date, STRING_AGG(DISTINCT action) actions
FROM `project.dataset.table`
GROUP BY user_id, date
)
WINDOW win AS (
PARTITION BY user_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
You can test it with below sample data (note extyra row with activity = 'play' )
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 user_id, DATE '2018-04-01' date, 'click' action UNION ALL
SELECT 2, '2018-04-01', 'view' UNION ALL
SELECT 1, '2018-04-02', 'view' UNION ALL
SELECT 1, '2018-04-02', 'play' UNION ALL
SELECT 2, '2018-04-02', 'view' UNION ALL
SELECT 2, '2018-04-03', 'buy'
)
SELECT user_id, date,
ARRAY(
SELECT DISTINCT action FROM UNNEST(SPLIT(actions)) action
) actions
FROM (
SELECT user_id, date , STRING_AGG(actions) OVER(win) actions
FROM (
SELECT user_id, date, STRING_AGG(DISTINCT action) actions
FROM `project.dataset.table`
GROUP BY user_id, date
)
WINDOW win AS (
PARTITION BY user_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
-- ORDER BY date, user_id
with result
Upvotes: 2