most efficient way to select duplicate rows with max timestamp

Question

Suppose I have a table called t, which is like

id  content  time
1     'a'     100
1     'a'     101
1     'b'     102
2     'c'     200
2     'c'     201

id are duplicate, and for the same id, content could also be duplicate. Now I want to select for each id the rows with max timestamp, which would be

id  content  time
1      'b'    102
2      'c'    201

And this is my current solution:

select t1.id, t1.content, t1.time 
from (
  select id, content, time from t 
) as t1 
right join (
  select id, max(time) as time from t group by id
) as t2 
on t1.id = t2.id and t1.time = t2.time;

But this looks inefficient to me. Because theoretically when select id, max(time) as time from t group by id is executed, the rows I want have already been located. The right join brings extra O(n^2) time cost, which seems unnecessary.

So is there any more efficient way to do it, or anything that I missunderstand?

Tim Biegeleisen · Accepted Answer

Use DISTINCT ON:

SELECT DISTINCT ON (id) id, content, time
FROM yourTable
ORDER BY id, time DESC;

On Postgres, this is usually the most performant way to write your query, and it should outperform ROW_NUMBER and other approaches.

The following index might speed up this query:

CREATE INDEX idx ON yourTable (id, time DESC, content);

This index, if used, would let Postgres rapidly find, for each id, the record having the latest time. This index also covers the content column.

most efficient way to select duplicate rows with max timestamp

Answers (2)

Related Questions