Magnilex
Magnilex

Reputation: 11988

Select rows based on two columns in SQL Server

I have a table which stores data where accidentally data has been stored multiple times because of case sensivity for the username field on server side code. The username field should be regarded as case insensitive. The important columns and data for the table can be found below.

My requirements now is to delete all but the most recent saved data. I'm writing an sql script for this, and started out by identifying all rows that are duplicates. This selection returns a table like below.

For each row, the most recent save is LASTUPDATEDDATE if it exist, otherwise CREATEDDATE. For this example, the most recent save for 'username' would be row 3.

ID CREATEDDATE LASTUPDATEDDATE USERNAME                           
-- ----------- --------------- --------
1  11-NOV-11                   USERNAME                                
2  01-NOV-11   02-NOV-11       username                            
3  8-JAN-12                    USERname      

My script (which selects all rows where a duplicated username appears) looks like:


SELECT 
  id, createddate, lastupdateddate, username
FROM
  table
WHERE
  LOWER(username)
IN
  (
    SELECT 
      LOWER(username)
    FROM 
      table
    GROUP BY 
      LOWER(username)
    HAVING 
      COUNT(*) > 1
  )  
ORDER BY
  LOWER(username)

My question now is: How do I select everything but row 3? I have searched Stack Overflow for a good match to this question, but found no match good enough. I know I probably have to make a join of some kind, but can't really get my head around it. Would be really thankful for a push in the right direction.

We are using SQL Server, probably a quite new version.

Upvotes: 0

Views: 2325

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1271231

To delete duplicates, you can use:

with todelete as (
     select t.*,
            row_number() over (partition by lower(username) order by createddate desc) as seqnum
     from table
    )
delete from t
where seqnum > 1

This assigns a sequential number to each row, starting with 1 for the most recent. It then deletes all but the most recent.

For two dates, you can use:

with todelete as (
     select t.*,
            row_number() over (partition by lower(username) order by thedate desc) as seqnum
     from (select t.*,
                  (case when createddate >= coalesdce(updateddate, createddate)
                        then createddate 
                        else updateddate
                   end) as thedate
           from table
          ) t
    )
delete from t
where seqnum > 1

Upvotes: 1

sgeddes
sgeddes

Reputation: 62861

A couple of things to note -- there is no reason to use LOWER in your query. A = a in SQL Server.

Also, to get the correct date, you can use COALESCE to determine if LastUpdatedDate exists and if so, sort by it, else sort by CreatedDate.

Putting that together, this should work:

DELETE T
FROM YourTable T 
  JOIN (
     SELECT *, ROW_NUMBER() OVER (PARTITION BY username  
                   ORDER BY COALESCE(lastupdateddate, createddate) DESC) as RN
     FROM YourTable
    ) T2 ON T.Id = T2.Id
WHERE T2.RN > 1

Here is a sample fiddle: http://www.sqlfiddle.com/#!3/51f7c/1

As @Gordon correctly suggests, you could also use a CTE depending on the version of SQL Server you use (2005+):

WITH CTE AS (
     SELECT *, ROW_NUMBER() OVER (PARTITION BY username  
                               ORDER BY COALESCE(lastupdateddate, createddate) DESC) as RN
     FROM YourTable
    ) 
DELETE FROM CTE WHERE RN > 1

Upvotes: 1

Related Questions