Reputation: 2116

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:

First: using query

SELECT *  FROM 
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn 
FROM mytable)  where rn=1

This results in Resources exceeded error that I guess is because of ORDER BY in the query.

Second Using join between table and latest time:

(SELECT tab1.* 
FROM mytable AS tab1
INNER JOIN EACH 
(SELECT ID, MAX(collection_time) AS second_time 
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)

this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.

I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?

Upvotes: 22

Answers (7)

Ruben Helsloot

Reputation: 13129

I see no one has mentioned window functions with QUALIFY:

SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp

The window function adds a column max_timestamp that is accessible in the QUALIFY clause to filter on.

Upvotes: 10

Sergey Geron

Reputation: 10222

Short and scalable version:

select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;

Upvotes: 8

Mit Parekh

Reputation: 191

SELECT
  agg.table.*
FROM (
  SELECT
    id,
    ARRAY_AGG(STRUCT(table)
    ORDER BY
      collection_time DESC)[SAFE_OFFSET(0)] agg
  FROM
    `dataset.table` table
  GROUP BY
    id)

This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this

Upvotes: 19

hsxavier

Reputation: 89

If you don't care about writing a piece of code for every column:

SELECT ID, 
  ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
  ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID

Upvotes: 3

Mosha Pasumansky

Reputation: 14014

Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):

SELECT 
  ID, 
  (SELECT srow.* 
   FROM UNNEST(t.srows) srow 
   WHERE srow.collection_time = MAX(srow.collection_time))
FROM
  (SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows 
   FROM id_table 
   GROUP BY ID) t

Upvotes: 0

Mikhail Berlyant

Reputation: 173171

Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:

SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn 
  FROM (
    SELECT tab1.* 
    FROM mytable AS tab1
    INNER JOIN (
      SELECT ID, MAX(collection_time) AS second_time 
      FROM mytable GROUP BY ID
    ) AS tab2
    ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
  )
)
WHERE rn = 1

And with Standard SQL (proposed by S.Mohsen sh)

WITH myTable AS (
  SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
  SELECT ID,
  MAX(collection_time) AS second_time 
  FROM myTable GROUP BY ID
),
tab2 AS (
  SELECT * FROM myTable
),
joint AS (
  SELECT tab2.* 
  FROM tab2 INNER JOIN tab1
  ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time 
)
SELECT * EXCEPT(rn) 
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn 
  FROM joint
)
WHERE rn=1

Upvotes: 5

Pரதீப்

Reputation: 93754

As per your comment, Considering you have a table with unique ID's for which you need to find latest collection_time. Here is another way to do it using Correlated Sub-Query. Give it a try.

SELECT id, 
       (SELECT Max(collection_time) 
        FROM   mytable B 
        WHERE  A.id = B.id) AS Max_collection_time 
FROM   id_table A

Upvotes: 0

Scalable Solution to get latest row for each ID in BigQuery

Answers (7)

Related Questions