Reputation: 2116
I have a quite large table with a field ID
and another field as collection_time
. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time)
time is not unique together in my data. I want just one of records with the maximum collection time
. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded
error that I guess is because of ORDER BY
in the query.
Second Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time)
are not unique together so in JOIN
result there would be multiple rows for each ID
.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
Upvotes: 22
Views: 43137
Reputation: 13129
I see no one has mentioned window functions with QUALIFY
:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
The window function adds a column max_timestamp
that is accessible in the QUALIFY
clause to filter on.
Upvotes: 10
Reputation: 10222
Short and scalable version:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Upvotes: 8
Reputation: 191
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
Upvotes: 19
Reputation: 89
If you don't care about writing a piece of code for every column:
SELECT ID,
ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID
Upvotes: 3
Reputation: 14014
Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):
SELECT
ID,
(SELECT srow.*
FROM UNNEST(t.srows) srow
WHERE srow.collection_time = MAX(srow.collection_time))
FROM
(SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows
FROM id_table
GROUP BY ID) t
Upvotes: 0
Reputation: 173171
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
Upvotes: 5
Reputation: 93754
As per your comment, Considering you have a table with unique ID's
for which you need to find latest collection_time
. Here is another way to do it using Correlated Sub-Query
. Give it a try.
SELECT id,
(SELECT Max(collection_time)
FROM mytable B
WHERE A.id = B.id) AS Max_collection_time
FROM id_table A
Upvotes: 0