Reputation: 43
I'm struggling to find the proper statements to select non-duplicate entries that are duplicates only for particular columns. As an example, in the following table I only care about rows that have unique values in col1, col2, and col3 and the values in col4 and col5 do not matter. This means I would consider row 1 and row 2 to be duplicates and row 4 and row 5 to be duplicates:
col1 col2 col3 col4 col5
A 2 p 0 2
A 2 p 1 8
A 3 r 4 12
B 0 f 3 1
B 0 f 6 5
And I would want to select only the following:
col1 col2 col3 col4 col5
A 2 p 0 2
A 3 r 4 12
B 0 f 3 1
Is there a way to combine multiple DISTINCT statements to achieve this or specify certain columns to ignore when comparing rows for duplicates?
Upvotes: 4
Views: 6172
Reputation: 34774
You have to choose which lines you want to keep, you can use the ROW_NUMBER()
function for this:
SELECT col1, col2, col3, col4, col5
FROM (SELECT *, ROW_NUMBER() OVER(PARTITION BY col1, col2, col3 ORDER BY col4 DESC) 'RowRank'
FROM table
)sub
WHERE RowRank = 1
You can change the ORDER BY
section to change which row you keep and which you toss. The ROW_NUMBER()
function just assigns a number to each row, in this example, you want to preserve each combination of col1
, col2
, col3
, so you PARTITION BY
them, meaning that numbering will start at 1 for each combination of them. You can run just the inside query to get the idea.
Alternatively, you could use GROUP BY
and aggregate functions, ie:
SELECT col1, col2, col3, MAX(col4), MAX(col5)
FROM table
GROUP BY col1, col2, col3
The downside here is that the MAX()
of col4
and col5
might come from different rows, so you're not necessarily returning one single row from your original table, but if you don't care which row you return then it doesn't matter.
Upvotes: 2