SL8t7
SL8t7

Reputation: 647

BigQuery - Get most recent data for each individual user

I wonder if anyone here can help with a BigQuery piece I am working on.

This will need to pull the most recent gplus/currents activity for each individual user in the domain. I have tried the following query, but this pulls every activity for every user:

SELECT
  TIMESTAMP_MICROS(time_usec) AS date,
  email,
  event_type,
  event_name
FROM
  `bqadminreporting.adminlogtracking.activity`
WHERE
  record_type LIKE 'gplus'
ORDER BY
  email ASC;

I have tried to use DISTINCT, but I still get multiple entries for the same user. Ideally, I need to do this looking back over 90 day... (So between today and 90 days ago, get the most recent activity for each user - if that makes sense?) which brings me to the issue with another question.

EDIT: Example data and expected output.

Fields: There are over 500 fields, I have just listed the relevant ones

+--------------------------------+---------+----------+
|           Field name           |  Type   |   Mode   |
+--------------------------------+---------+----------+
| time_usec                      | INTEGER | NULLABLE |
| email                          | STRING  | NULLABLE |
| event_type                     | STRING  | NULLABLE |
| event_name                     | STRING  | NULLABLE |
| record_type                    | STRING  | NULLABLE |
| gplus                          | RECORD  | NULLABLE |
| gplus. log_event_resource_name | STRING  | NULLABLE |
| gplus. attachment_type         | STRING  | NULLABLE |
| gplus. plusone_context         | STRING  | NULLABLE |
| gplus. post_permalink          | STRING  | NULLABLE |
| gplus. post_resource_name      | STRING  | NULLABLE |
| gplus. comment_resource_name   | STRING  | NULLABLE |
| gplus. post_visibility         | STRING  | NULLABLE |
| gplus. user_type               | STRING  | NULLABLE |
| gplus. post_author_name        | STRING  | NULLABLE |
+--------------------------------+---------+----------+

Output from my query: This is the output I get when running my query above.

+-----+--------------------------------+------------------+----------------+----------------+
| Row |              date              |      email       |   event_type   |   event_name   |
+-----+--------------------------------+------------------+----------------+----------------+
|   1 | 2020-01-30 07:10:19.088 UTC    | [email protected] | post_change    | create_post    |
|   2 | 2020-03-03 08:47:25.086485 UTC | [email protected] | coment_change  | create_comment |
|   3 | 2020-03-23 09:10:09.522 UTC    | [email protected] | post_change    | create_post    |
|   4 | 2020-03-23 09:49:00.337 UTC    | [email protected] | plusone_change | remove_plusone |
|   5 | 2020-03-23 09:48:10.461 UTC    | [email protected] | plusone_change | add_plusone    |
|   6 | 2020-01-30 10:04:29.757005 UTC | [email protected] | coment_change  | create_comment |
|   7 | 2020-03-28 08:52:50.711359 UTC | [email protected] | coment_change  | create_comment |
|   8 | 2020-11-08 10:08:09.161325 UTC | [email protected] | coment_change  | create_comment |
|   9 | 2020-04-21 15:28:10.022683 UTC | [email protected] | coment_change  | create_comment |
|  10 | 2020-03-28 09:37:28.738863 UTC | [email protected] | coment_change  | create_comment |
+-----+--------------------------------+------------------+----------------+----------------+

Desired result: Only 1 row of data per user, showing only the most recent event.

+-----+--------------------------------+------------------+----------------+----------------+
| Row |              date              |      email       |   event_type   |   event_name   |
+-----+--------------------------------+------------------+----------------+----------------+
|   1 | 2020-03-23 09:49:00.337 UTC    | [email protected] | plusone_change | remove_plusone |
|   2 | 2020-11-08 10:08:09.161325 UTC | [email protected] | coment_change  | create_comment |
|   3 | 2020-04-21 15:28:10.022683 UTC | [email protected] | coment_change  | create_comment |
|   4 | 2020-03-28 09:37:28.738863 UTC | [email protected] | coment_change  | create_comment |
+-----+--------------------------------+------------------+----------------+----------------+

Upvotes: 1

Views: 3800

Answers (3)

Sergey Geron
Sergey Geron

Reputation: 10162

Use array_agg:

select 
  email,
  array_agg(STRUCT(TIMESTAMP_MICROS(time_usec) as date, event_type, event_name) ORDER BY time_usec desc LIMIT 1)[OFFSET(0)].*
from `bqadminreporting.adminlogtracking.activity`
where
  record_type LIKE 'gplus'
  and time_usec > unix_micros(timestamp_sub(current_timestamp(), interval 90 day))
group by email
order by email

Test example:

with mytable as (
  select timestamp '2020-01-30 07:10:19.088 UTC' as date, '[email protected]' as email, 'post_change' as event_type, 'create_post' as event_name union all
  select timestamp '2020-03-03 08:47:25.086485 UTC', '[email protected]', 'coment_change', 'create_comment' union all
  select timestamp '2020-03-23 09:10:09.522 UTC', '[email protected]', 'post_change', 'create_post' union all
  select timestamp '2020-03-23 09:49:00.337 UTC', '[email protected]', 'plusone_change', 'remove_plusone' union all
  select timestamp '2020-03-23 09:48:10.461 UTC', '[email protected]', 'plusone_change', 'add_plusone' union all
  select timestamp '2020-01-30 10:04:29.757005 UTC', '[email protected]', 'coment_change', 'create_coment' union all
  select timestamp '2020-03-28 08:52:50.711359 UTC', '[email protected]', 'coment_change', 'create_coment' union all
  select timestamp '2020-11-08 10:08:09.161325 UTC', '[email protected]', 'coment_change', 'create_coment' union all
  select timestamp '2020-04-21 15:28:10.022683 UTC', '[email protected]', 'coment_change', 'create_coment' union all
  select timestamp '2020-03-28 09:37:28.738863 UTC', '[email protected]', 'coment_change', 'create_coment'
)
select 
  email,
  array_agg(STRUCT(date, event_type, event_name) ORDER BY date desc LIMIT 1)[OFFSET(0)].*
from mytable
group by email

enter image description here

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269693

If you want all columns from the most recent row, you can use this BigQuery syntax:

select array_agg(t order by date desc limit 1)[ordinal(1)].*
from mytable t
group by t.email;

If you want specific columns, then Sergey's solution might be simpler.

Upvotes: 1

Vibhor Gupta
Vibhor Gupta

Reputation: 699

An alternative way to solve your problem is :-

select * from (
select 
max (date1) max_dt
from  mytable
group by date(date1)), mytable
where date1=max_dt

Upvotes: 0

Related Questions