Removing duplicate rows with condition

Question

User sessions are tracked on the system and stored in the following format. Sometimes I get multiple records for the same session id.

Row session_id                              user_actions     
1   8a88d75c-6385-4e36-8d10-e22ac4d976a3    118,139,141  
2   8a88d75c-6385-4e36-8d10-e22ac4d976a3    118,139,141,142,143,146  
3   e85731b6-4472-40fb-ab2b-33ebd1278ba9    211,114,117,118,141,142,143,146  
4   e85731b6-4472-40fb-ab2b-33ebd1278ba9    211,114,117

I used to run a sql query with DISTINCT(session_id to keep only one of the multiple records for each session id. BUT I just realized that my query picks the row on the top even when if the bottom row recorded more actions for the same session. So you look at the following table, my query keeps Row 1 & 3, like this;

Row session_id                              user_actions     
1   8a88d75c-6385-4e36-8d10-e22ac4d976a3    118,139,141  
3   e85731b6-4472-40fb-ab2b-33ebd1278ba9    211,114,117,118,141,142,143,146

Whereas, I would like to keep row 2 and 3, like this;

Row session_id                              user_actions     
2   8a88d75c-6385-4e36-8d10-e22ac4d976a3    118,139,141,142,143,146  
3   e85731b6-4472-40fb-ab2b-33ebd1278ba9    211,114,117,118,141,142,143,146

Is there anyway to do it with a sql query? Thank you!

Mikhail Berlyant · Accepted Answer

Below is one of the option for BigQuery Standard SQL

#standardSQL
SELECT row, session_id, user_actions
FROM (
  SELECT 
    row, session_id, user_actions,
    ROW_NUMBER() OVER(PARTITION BY session_id 
      ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
    ) = 1 win
  FROM `project.dataset.table`
)
WHERE win

You can test / play with above using dummy data from your question as below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
  SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
  SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
  SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117' 
)
SELECT row, session_id, user_actions
FROM (
  SELECT 
    row, session_id, user_actions,
    ROW_NUMBER() OVER(PARTITION BY session_id 
      ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
    ) = 1 win
  FROM `project.dataset.table`
)
WHERE win
ORDER BY row

result is

row session_id                              user_actions     
2   8a88d75c-6385-4e36-8d10-e22ac4d976a3    118,139,141,142,143,146  
3   e85731b6-4472-40fb-ab2b-33ebd1278ba9    211,114,117,118,141,142,143,146

Another option would be as below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
  SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
  SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
  SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117' 
)
SELECT session_id, 
  ARRAY_AGG(user_actions ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC LIMIT 1)[SAFE_OFFSET(0)] user_actions
FROM `project.dataset.table`
GROUP BY session_id

This one looks a little cleaner :o)

You can extend above with for example combining distinct codes from deduplicated entries in case if (for example) some actions are missing in one row but not another and etc.)

Update:

Try below to separate expenses of calculating array_length from ordering in partition:

#standardSQL
SELECT row, session_id, user_actions
FROM (
  SELECT 
    row, session_id, user_actions, 
    ROW_NUMBER() OVER(PARTITION BY session_id ORDER BY len DESC) = 1 win
  FROM (
    SELECT *, ARRAY_LENGTH(SPLIT(user_actions)) len
    FROM `project.dataset.table`
  )
)
WHERE win

Removing duplicate rows with condition

Answers (1)

Related Questions