Reputation: 179
User sessions are tracked on the system and stored in the following format. Sometimes I get multiple records for the same session id.
Row session_id user_actions
1 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
4 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117
I used to run a sql query with DISTINCT(session_id
to keep only one of the multiple records for each session id. BUT I just realized that my query picks the row on the top even when if the bottom row recorded more actions for the same session. So you look at the following table, my query keeps Row 1 & 3, like this;
Row session_id user_actions
1 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
Whereas, I would like to keep row 2 and 3, like this;
Row session_id user_actions
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
Is there anyway to do it with a sql query? Thank you!
Upvotes: 1
Views: 352
Reputation: 172993
Below is one of the option for BigQuery Standard SQL
#standardSQL
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id
ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
) = 1 win
FROM `project.dataset.table`
)
WHERE win
You can test / play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117'
)
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id
ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
) = 1 win
FROM `project.dataset.table`
)
WHERE win
ORDER BY row
result is
row session_id user_actions
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
Another option would be as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117'
)
SELECT session_id,
ARRAY_AGG(user_actions ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC LIMIT 1)[SAFE_OFFSET(0)] user_actions
FROM `project.dataset.table`
GROUP BY session_id
This one looks a little cleaner :o)
You can extend above with for example combining distinct codes from deduplicated entries in case if (for example) some actions are missing in one row but not another and etc.)
Update:
Try below to separate expenses of calculating array_length from ordering in partition:
#standardSQL
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id ORDER BY len DESC) = 1 win
FROM (
SELECT *, ARRAY_LENGTH(SPLIT(user_actions)) len
FROM `project.dataset.table`
)
)
WHERE win
Upvotes: 2