Rob
Rob

Reputation: 947

How to join two tables while removing repeated entries in one column of one table

I'm not very familiar with SQL and have hit a problem with duplicate-removal that I need some help with.

I have two tables as follows:

config -> ID  -----------
       -> timestamp      |
       -> vcsID          |
       -> platform       |
                         |
data   -> ID             |
       -> configID <-----
       -> value

For each row in the config table there may be multiple rows in the data table, the join being:

data.configID = config.ID

However, in the config table there may be entries in which vcsID has not advanced (these represent builds on the same code from our version control system, and 'value' represents a build metric such as memory size); they are not duplicate rows, because the timestamp will be different, but I need to get rid of them. I want to create a view on the data that removes these duplicates. How do I go about doing that? In MySQL I can do the following:

select *
from (select * from config group by vcsID) as filtered
inner join data
on data.configID = filtered.ID

...but Postgres is strictly compliant with the SQL standard and requires that anything in the Select clause must be in the Group By clause so I can't get the ID field out of the sub-query to do the join.

As an example, some sample data might be:

config
ID    timestamp     vcsID   platform
1   1/1/2014 09:00    18       "x"
2   1/1/2014 20:20    30       "y"
3   1/1/2014 20:25    30       "y"
4   1/1/2014 20:40    31       "y"

data
ID  configID   value
12      1      40000
13      2       125
14      3       125
15      4       130

...and, taking platform "y" as the thing I'm interested in, what I'd like to get out is:

               config                 |         data
ID    timestamp     vcsID  platform   |  ID  configID value
2   1/1/2014 20:20    30      "y"     |  13      2     125
4   1/1/2014 20:40    31      "y"     |  15      4     125

or, equivalently:

               config                 |         data
ID    timestamp     vcsID  platform   |  ID  configID value
3   1/1/2014 20:25    30      "y"     |  14      3     125
4   1/1/2014 20:40    31      "y"     |  15      4     125

Any help is greatly appreciated.

Upvotes: 2

Views: 2853

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656301

Use DISTINCT ON:

SELECT DISTINCT ON (vcsID) *
FROM   config c
JOIN   data   d ON d.configID = c.ID
ORDER  BY vcsID, "timestamp" DESC;

Assuming you want to pick the latest row from each group of identical vcsID, thus the ORDER BY. If you really don't care which row you get for each vcsID, you don't need ORDER BY. Either way, the leading columns in ORDER BY have to match DISTINCT ON expressions, so you cannot ORDER BY c.id, like you seem to want. You'd need to wrap this in a sub-query and order in the outer query.

Detailed explanation for DISTINCT ON and alternative solutions:

Aside: don't use basic type names like timestamp as identifiers.

Upvotes: 2

NullEverything
NullEverything

Reputation: 460

Try this query:

select distinct on (vscID) * from config c join data d on d.configID = c.ID where c.platform = "y"

It should give you what you are looking for. The "distinct on (column...)" removes duplicates based on specific column(s).

Upvotes: 0

Related Questions