user3909653
user3909653

Reputation: 99

Removing SQL Duplicates via SELECT

I have another quick SQL question. Consider the following table:

Value1   |   Value2   |   Value3   
------------------------------------
Peter    |   Blue     |    Red
Peter    |   Null     |    Null
Martin   |   Blue     |    Null
Martin   |   Null     |    Null
Boris    |   Null     |    Null
Sergej   |   Null     |    Green
Sergej   |   Null     |    Null

You see this is a peculiar case. There are Cases where Value2 and Value3 are both set in one entry, there are cases where either Value2 and Value3 are set and there are cases where none of them are set (which don't have a duplicate).

So the question is: How do I remove the duplicates (via Value1) and get the entries with the most information? I.e. if Value2 and/or Value3 are set, then the entries with those. Clearly I can't just use "not null" as there are cases (the ones not duplicated to begin with) I need to cover which have both on Null.

The second problem is that I only have reading access, so it has to happen within a Select statement. Thank you very much.

Upvotes: 1

Views: 145

Answers (1)

Robert Sheahan
Robert Sheahan

Reputation: 2100

OK for MS SQL this will do it

DECLARE @T TABLE (V1 VARCHAR(50), V2 VARCHAR(50), V3 VARCHAR(50))
INSERT INTO @T VALUES ('Peter', 'Blue', 'Red'), ('Peter', Null, Null), ('Martin', 'Blue',Null), 
    ('Martin', Null, Null), ('Boris', Null, Null), ('Sergej', Null, 'Green'), ('Sergej', Null, Null)
SElECT V1, V2, V3 
FROM (SELECT V1, V2, V3, ROW_NUMBER () 
          OVER (PARTITION BY V1 ORDER BY CASE WHEN V2 IS NULL THEN 1 ELSE 0 END 
                    + CASE WHEN V3 IS NULL THEN 1 ELSE 0 END) as Quality 
      FROM @T) as T
WHERE Quality = 1

RESULT

V1       V2      V3  
Boris   NULL    NULL  
Martin  Blue    NULL  
Peter   Blue    Red  
Sergej  NULL    Green  

EDIT: Note: this will give a single entry per name, even if there are multiple rows that contain the same amount of information. That is, if Peter has 2 rows that both have non-null values in V2 and V3, the system will pick one at random.

If you want all rows with the maximum amount of information, you could replace ROW_NUMBER with RANK.

This also considers V2 and V3 to be of equal weight, so a row with just V2 NULL and a row with just V3 NULL are equal. You could change that behavior by changing the CASE statements to return different values for the 2 fields. i.e. for both make NULL worth 3 and the NON-NULL for one worth 0 and the other worth 1, so NON-NULL NON-NULL is Q=1, NULL NON-NULL is Q = 3, NON-NULL NULL is Q=4, and NULL NULL is Q=6.

EDIT 2: Incorrectly had NULL NULL twice in the explanation immediately above :-(

EDIT 3: Extended explanation as requested in comments
Sure, no problem. The "ROW_NUMBER" (and the RANK) function normally generates a sequence of numbers for your dataset. To know what the order should be based on, you have to tell it. So both functions require a "OVER (ORDER BY Col1[,Col2 ...])" clause. The ORDER BY inside the OVER() works just like an ORDER BY clause at the end of a query.

In this case, I'm not using a real column in your data, I'm deriving a column (which is anonymous, I'd like to give it an alias for clarity but SQL-Server 2008 R2 doesn't support that). For reference, let's call that derived column Q even though SQL won't let us actually name it. My derived column is the sum of the 2 CASE statements, so it's an integer value that represents the number of NULLs in that row. Since the ORDER BY instruction defaults to ASCENDING order, the rows with the most data (the fewest NULLs) will have the lowest "Q" and sort to the top.

And I give the whole ROW_NUMBER function output an Alias - "Quality". It's not the same as the "Q" I described above, but it's related to it. Quality will be an integer sequence starting at 1 and increasing by 1 for each row, whereas Q will be 0, 1, or 2 depending on how many NULLs there are in the row. The lowest Q rows will get the lowest Quality numbers, but for rows with identical Q values SQL will randomly order them.

The last bit of the solution is the PARTITION BY clause, it tells the ROW_NUMBER (or RANK) function to break the data into sets (just like a GROUP BY clause in a query, and it too can take 1 or more columns) and re-start its numbering with each group. This way, Peter gets his his own 1,2,3,4,... Quality values, Martin gets his own 1,2,3,4,... values, etc.

So when I put the "WHERE Quality = 1" clause at the end of my query, I'm saying "For each person, pick the row that has the fewest NULLs"

I hope that's the question you were asking, I'm not sure I understand your "if I simply ORDER BY 1" bit.

Upvotes: 1

Related Questions