Reputation: 947
I'm not very familiar with SQL and have hit a problem with duplicate-removal that I need some help with.
I have two tables as follows:
config -> ID -----------
-> timestamp |
-> vcsID |
-> platform |
|
data -> ID |
-> configID <-----
-> value
For each row in the config table there may be multiple rows in the data table, the join being:
data.configID = config.ID
However, in the config table there may be entries in which vcsID has not advanced (these represent builds on the same code from our version control system, and 'value' represents a build metric such as memory size); they are not duplicate rows, because the timestamp will be different, but I need to get rid of them. I want to create a view on the data that removes these duplicates. How do I go about doing that? In MySQL I can do the following:
select *
from (select * from config group by vcsID) as filtered
inner join data
on data.configID = filtered.ID
...but Postgres is strictly compliant with the SQL standard and requires that anything in the Select clause must be in the Group By clause so I can't get the ID field out of the sub-query to do the join.
As an example, some sample data might be:
config
ID timestamp vcsID platform
1 1/1/2014 09:00 18 "x"
2 1/1/2014 20:20 30 "y"
3 1/1/2014 20:25 30 "y"
4 1/1/2014 20:40 31 "y"
data
ID configID value
12 1 40000
13 2 125
14 3 125
15 4 130
...and, taking platform "y" as the thing I'm interested in, what I'd like to get out is:
config | data
ID timestamp vcsID platform | ID configID value
2 1/1/2014 20:20 30 "y" | 13 2 125
4 1/1/2014 20:40 31 "y" | 15 4 125
or, equivalently:
config | data
ID timestamp vcsID platform | ID configID value
3 1/1/2014 20:25 30 "y" | 14 3 125
4 1/1/2014 20:40 31 "y" | 15 4 125
Any help is greatly appreciated.
Upvotes: 2
Views: 2853
Reputation: 656301
Use DISTINCT ON
:
SELECT DISTINCT ON (vcsID) *
FROM config c
JOIN data d ON d.configID = c.ID
ORDER BY vcsID, "timestamp" DESC;
Assuming you want to pick the latest row from each group of identical vcsID
, thus the ORDER BY
. If you really don't care which row you get for each vcsID
, you don't need ORDER BY
. Either way, the leading columns in ORDER BY
have to match DISTINCT ON
expressions, so you cannot ORDER BY c.id
, like you seem to want. You'd need to wrap this in a sub-query and order in the outer query.
Detailed explanation for DISTINCT ON
and alternative solutions:
Aside: don't use basic type names like timestamp
as identifiers.
Upvotes: 2
Reputation: 460
Try this query:
select distinct on (vscID) * from config c join data d on d.configID = c.ID where c.platform = "y"
It should give you what you are looking for. The "distinct on (column...)" removes duplicates based on specific column(s).
Upvotes: 0