Robin
Robin

Reputation: 115

BigQuery: Retrieve rows that are unique across two columns, otherwise row with largest third column

I have a BigQuery table, my_table, that looks like the following:

+---------+---------+-------+------------------+----------+--------+-----+--------+
| poll_id | user_id | count |    timestamp     | timezone | answer | age | gender |
+---------+---------+-------+------------------+----------+--------+-----+--------+
|       1 |       1 |     5 | 2019-08-06 11:00 |        1 | no     |  25 | male   |
|       1 |       1 |    10 | 2019-08-06 10:00 |        1 | no     |  25 | male   |
|       1 |       1 |    10 | 2019-08-06 10:30 |        1 | yes    |  25 | male   |
|       1 |       2 |    10 | 2019-08-06 11:00 |        1 | no     |  35 | male   |
|       1 |       2 |    20 | 2019-08-06 11:00 |        1 | no     |  35 | male   |
|       1 |       2 |    35 | 2019-08-06 11:00 |        1 | NULL   |  35 | male   |
|       2 |       1 |    10 | 2019-08-06 10:35 |        1 | no     |  25 | male   |
|       3 |       1 |    10 | 2019-08-06 10:35 |        1 | NULL   |  25 | male   |
+---------+---------+-------+------------------+----------+--------+-----+--------+

I want to retrieve rows that fulfill the following requirements:

I also want to be able to limit the search to a specific date and timezone, for example a date of 2019-08-06 and a timezone of 1, and I don't want to retrieve the rows with a value of NULL in user_id.

So far, I've tried the following standard SQL statement:

  SELECT
    t1.poll_id,
    t1.user_id,
    t1.count,
    t1.timestamp,
    t1.timezone,
    t1.answer,
    t1.age,
    t1.gender,
  FROM
    `my_table` t1
  LEFT JOIN
    `my_table` t2
  ON
    t1.poll_id = t2.poll_id
    AND t1.user_id = t2.user_id
    AND t1.count < t2.count
    AND t2.answer IS NOT NULL
    AND DATE(t2.timestamp, "+1:00") = "2019-08-06"
  WHERE
    t1.user_id IS NOT NULL
    AND t1.answer IS NOT NULL
    AND DATE(t1.timestamp, "+1:00") = "2019-08-06"
    AND t1.timezone = 1   
    AND t2.count IS NULL

The expected result for the shown table is:

+---------+---------+-------+------------------+----------+--------+-----+--------+
| poll_id | user_id | count |    timestamp     | timezone | answer | age | gender |
+---------+---------+-------+------------------+----------+--------+-----+--------+
|       1 |       1 |    10 | 2019-08-06 10:30 |        1 | yes    |  25 | male   | // count = 10 and largest timestamp
|       1 |       2 |    20 | 2019-08-06 11:00 |        1 | no     |  35 | male   | // count = 20 (the 35 row had NULL in 'answer')
|       2 |       1 |    10 | 2019-08-06 10:35 |        1 | no     |  25 | male   | // unique 'poll_id', 'user_id' combination
+---------+---------+-------+------------------+----------+--------+-----+--------+

However, it seems there are two problems:

  1. If there is more than one row with the same (largest) count value, all those rows are retrieved. This means both row 2 and 3 are retrieved in this example.
  2. If there are exactly two rows for a poll_id, user_id combination, neither is retrieved, even if they have different count values.

That is at least how it seems. I'm having a hard time tracking the problem and, of course, figuring out the correct query.

Any help would be appreciated.

Upvotes: 1

Views: 1873

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172994

Below is for BigQuery Standard SQL

#standardSQL
SELECT * EXCEPT(pos) 
FROM (
  SELECT *, 
    ROW_NUMBER() OVER(PARTITION BY poll_id, user_id ORDER BY count DESC, timestamp DESC) AS pos
  FROM `project.dataset.table`
  WHERE NOT answer IS NULL
  AND NOT user_id IS NULL
  AND timezone = 1
  AND SUBSTR(timestamp, 1, 10) = '2019-08-06'
)
WHERE pos = 1   

If to apply to sample data from your question - result is

Row poll_id user_id count   timestamp           timezone    answer  age gender   
1   1       1       10      2019-08-06 10:30    1           yes     25  male     
2   1       2       20      2019-08-06 11:00    1           no      35  male     
3   2       1       10      2019-08-06 10:35    1           no      25  male     

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269733

For this type of query, row_number() is usually appropriate. I think this fits what you have described:

select t.*
from (select t.*,
             row_number() over (partition by poll_id, user_id order by count desc, timestamp desc) as seqnum
      from my_table t
      where answer is not nll
     ) t
where seqnum = 1;

Upvotes: 2

Related Questions