Postgresql select duplicate column by another column

Question

I have a data set like

ID STATUS SOURCE
1  new    data1
1  old    data2
2  old    data1
2  new    data2

and I want to be able to select those duplicates ID by the STATUS column and keep the SOURCE, the final list will be:

ID STATUS SOURCE
1  new    data1
2  new    data2

I can make the list of duplicate ID with something like:

select id, status, source 
from data
where id in (
select id
from data
group by id
having (count(* ) > 1)

then I can not find a way to filter by the status to remove the duplicates.

Thanks.

Raffael · Accepted Answer

You can make use of Postgresql's DISTINCT ON feature in conjunction with an ORDER BY clause.

SELECT DISTINCT ON (id) 
       id, 
       status, 
       source
FROM   data
ORDER  BY id, status

Here is an aequivalent query that does not use any Postgres specific features:

SELECT id, 
       status, 
       source
FROM   (SELECT id, 
               status, 
               source,
               row_number() OVER (PARTITION BY id ORDER BY status) AS n 
        FROM   data) AS sub
WHERE   sub.n = 1

The ORDER BY clause feels a bit clumsy with this data set (in both query variants) because it uses alphabetical ordering in order to express the semantic ordering of "new is newer than old". The ordering would feel more natural if we used a timestamp column created_at (or similar) instead of the status column.

Postgresql select duplicate column by another column

Answers (1)

Related Questions