Reputation: 115
I have a BigQuery table, my_table
, that looks like the following:
+---------+---------+-------+------------------+----------+--------+-----+--------+
| poll_id | user_id | count | timestamp | timezone | answer | age | gender |
+---------+---------+-------+------------------+----------+--------+-----+--------+
| 1 | 1 | 5 | 2019-08-06 11:00 | 1 | no | 25 | male |
| 1 | 1 | 10 | 2019-08-06 10:00 | 1 | no | 25 | male |
| 1 | 1 | 10 | 2019-08-06 10:30 | 1 | yes | 25 | male |
| 1 | 2 | 10 | 2019-08-06 11:00 | 1 | no | 35 | male |
| 1 | 2 | 20 | 2019-08-06 11:00 | 1 | no | 35 | male |
| 1 | 2 | 35 | 2019-08-06 11:00 | 1 | NULL | 35 | male |
| 2 | 1 | 10 | 2019-08-06 10:35 | 1 | no | 25 | male |
| 3 | 1 | 10 | 2019-08-06 10:35 | 1 | NULL | 25 | male |
+---------+---------+-------+------------------+----------+--------+-----+--------+
I want to retrieve rows that fulfill the following requirements:
poll_id
and user_id
, include the row if it has a non-NULL value in answer
count
that is not NULL in the answer
column
count
(and non-NULL answer
), include the row with the largest timestamp
I also want to be able to limit the search to a specific date and timezone, for example a date of 2019-08-06 and a timezone of 1, and I don't want to retrieve the rows with a value of NULL in user_id
.
So far, I've tried the following standard SQL statement:
SELECT
t1.poll_id,
t1.user_id,
t1.count,
t1.timestamp,
t1.timezone,
t1.answer,
t1.age,
t1.gender,
FROM
`my_table` t1
LEFT JOIN
`my_table` t2
ON
t1.poll_id = t2.poll_id
AND t1.user_id = t2.user_id
AND t1.count < t2.count
AND t2.answer IS NOT NULL
AND DATE(t2.timestamp, "+1:00") = "2019-08-06"
WHERE
t1.user_id IS NOT NULL
AND t1.answer IS NOT NULL
AND DATE(t1.timestamp, "+1:00") = "2019-08-06"
AND t1.timezone = 1
AND t2.count IS NULL
The expected result for the shown table is:
+---------+---------+-------+------------------+----------+--------+-----+--------+
| poll_id | user_id | count | timestamp | timezone | answer | age | gender |
+---------+---------+-------+------------------+----------+--------+-----+--------+
| 1 | 1 | 10 | 2019-08-06 10:30 | 1 | yes | 25 | male | // count = 10 and largest timestamp
| 1 | 2 | 20 | 2019-08-06 11:00 | 1 | no | 35 | male | // count = 20 (the 35 row had NULL in 'answer')
| 2 | 1 | 10 | 2019-08-06 10:35 | 1 | no | 25 | male | // unique 'poll_id', 'user_id' combination
+---------+---------+-------+------------------+----------+--------+-----+--------+
However, it seems there are two problems:
count
value, all those rows are retrieved. This means both row 2 and 3 are retrieved in this example.poll_id
, user_id
combination, neither is retrieved, even if they have different count
values.That is at least how it seems. I'm having a hard time tracking the problem and, of course, figuring out the correct query.
Any help would be appreciated.
Upvotes: 1
Views: 1873
Reputation: 172994
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(pos)
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY poll_id, user_id ORDER BY count DESC, timestamp DESC) AS pos
FROM `project.dataset.table`
WHERE NOT answer IS NULL
AND NOT user_id IS NULL
AND timezone = 1
AND SUBSTR(timestamp, 1, 10) = '2019-08-06'
)
WHERE pos = 1
If to apply to sample data from your question - result is
Row poll_id user_id count timestamp timezone answer age gender
1 1 1 10 2019-08-06 10:30 1 yes 25 male
2 1 2 20 2019-08-06 11:00 1 no 35 male
3 2 1 10 2019-08-06 10:35 1 no 25 male
Upvotes: 2
Reputation: 1269733
For this type of query, row_number()
is usually appropriate. I think this fits what you have described:
select t.*
from (select t.*,
row_number() over (partition by poll_id, user_id order by count desc, timestamp desc) as seqnum
from my_table t
where answer is not nll
) t
where seqnum = 1;
Upvotes: 2