user2530187
user2530187

Reputation: 43

T-SQL Eliminating duplicate rows while ignoring certain columns

I'm struggling to find the proper statements to select non-duplicate entries that are duplicates only for particular columns. As an example, in the following table I only care about rows that have unique values in col1, col2, and col3 and the values in col4 and col5 do not matter. This means I would consider row 1 and row 2 to be duplicates and row 4 and row 5 to be duplicates:

col1 col2 col3 col4 col5
 A    2    p    0    2
 A    2    p    1    8
 A    3    r    4    12
 B    0    f    3    1
 B    0    f    6    5

And I would want to select only the following:

col1 col2 col3 col4 col5
 A    2    p    0    2
 A    3    r    4    12
 B    0    f    3    1

Is there a way to combine multiple DISTINCT statements to achieve this or specify certain columns to ignore when comparing rows for duplicates?

Upvotes: 4

Views: 6172

Answers (1)

Hart CO
Hart CO

Reputation: 34774

You have to choose which lines you want to keep, you can use the ROW_NUMBER() function for this:

SELECT col1, col2, col3, col4, col5
FROM (SELECT *, ROW_NUMBER() OVER(PARTITION BY col1, col2, col3 ORDER BY col4 DESC) 'RowRank'
      FROM table
     )sub 
WHERE RowRank = 1

You can change the ORDER BY section to change which row you keep and which you toss. The ROW_NUMBER() function just assigns a number to each row, in this example, you want to preserve each combination of col1, col2, col3, so you PARTITION BY them, meaning that numbering will start at 1 for each combination of them. You can run just the inside query to get the idea.

Alternatively, you could use GROUP BY and aggregate functions, ie:

SELECT col1, col2, col3, MAX(col4), MAX(col5)
FROM table
GROUP BY col1, col2, col3

The downside here is that the MAX() of col4 and col5 might come from different rows, so you're not necessarily returning one single row from your original table, but if you don't care which row you return then it doesn't matter.

Upvotes: 2

Related Questions