Reputation: 25
I have a data set like
ID STATUS SOURCE
1 new data1
1 old data2
2 old data1
2 new data2
and I want to be able to select those duplicates ID by the STATUS column and keep the SOURCE, the final list will be:
ID STATUS SOURCE
1 new data1
2 new data2
I can make the list of duplicate ID with something like:
select id, status, source
from data
where id in (
select id
from data
group by id
having (count(* ) > 1)
then I can not find a way to filter by the status to remove the duplicates.
Thanks.
Upvotes: 0
Views: 908
Reputation: 2669
You can make use of Postgresql's DISTINCT ON
feature in conjunction with an ORDER BY
clause.
SELECT DISTINCT ON (id)
id,
status,
source
FROM data
ORDER BY id, status
Here is an aequivalent query that does not use any Postgres specific features:
SELECT id,
status,
source
FROM (SELECT id,
status,
source,
row_number() OVER (PARTITION BY id ORDER BY status) AS n
FROM data) AS sub
WHERE sub.n = 1
The ORDER BY
clause feels a bit clumsy with this data set (in both query variants) because it uses alphabetical ordering in order to express the semantic ordering of "new
is newer than old
". The ordering would feel more natural if we used a timestamp column created_at
(or similar) instead of the status
column.
Upvotes: 1