How to join two tables while removing repeated entries in one column of one table

Question

I'm not very familiar with SQL and have hit a problem with duplicate-removal that I need some help with.

I have two tables as follows:

config -> ID  -----------
       -> timestamp      |
       -> vcsID          |
       -> platform       |
                         |
data   -> ID             |
       -> configID <-----
       -> value

For each row in the config table there may be multiple rows in the data table, the join being:

data.configID = config.ID

However, in the config table there may be entries in which vcsID has not advanced (these represent builds on the same code from our version control system, and 'value' represents a build metric such as memory size); they are not duplicate rows, because the timestamp will be different, but I need to get rid of them. I want to create a view on the data that removes these duplicates. How do I go about doing that? In MySQL I can do the following:

select *
from (select * from config group by vcsID) as filtered
inner join data
on data.configID = filtered.ID

...but Postgres is strictly compliant with the SQL standard and requires that anything in the Select clause must be in the Group By clause so I can't get the ID field out of the sub-query to do the join.

As an example, some sample data might be:

config
ID    timestamp     vcsID   platform
1   1/1/2014 09:00    18       "x"
2   1/1/2014 20:20    30       "y"
3   1/1/2014 20:25    30       "y"
4   1/1/2014 20:40    31       "y"

data
ID  configID   value
12      1      40000
13      2       125
14      3       125
15      4       130

...and, taking platform "y" as the thing I'm interested in, what I'd like to get out is:

               config                 |         data
ID    timestamp     vcsID  platform   |  ID  configID value
2   1/1/2014 20:20    30      "y"     |  13      2     125
4   1/1/2014 20:40    31      "y"     |  15      4     125

or, equivalently:

               config                 |         data
ID    timestamp     vcsID  platform   |  ID  configID value
3   1/1/2014 20:25    30      "y"     |  14      3     125
4   1/1/2014 20:40    31      "y"     |  15      4     125

Any help is greatly appreciated.

Erwin Brandstetter · Accepted Answer

Use DISTINCT ON:

SELECT DISTINCT ON (vcsID) *
FROM   config c
JOIN   data   d ON d.configID = c.ID
ORDER  BY vcsID, "timestamp" DESC;

Assuming you want to pick the latest row from each group of identical vcsID, thus the ORDER BY. If you really don't care which row you get for each vcsID, you don't need ORDER BY. Either way, the leading columns in ORDER BY have to match DISTINCT ON expressions, so you cannot ORDER BY c.id, like you seem to want. You'd need to wrap this in a sub-query and order in the outer query.

Detailed explanation for DISTINCT ON and alternative solutions:

Select first row in each GROUP BY group?

Aside: don't use basic type names like timestamp as identifiers.

How to join two tables while removing repeated entries in one column of one table

Answers (2)

Related Questions