Select rows based on two columns in SQL Server

Question

I have a table which stores data where accidentally data has been stored multiple times because of case sensivity for the username field on server side code. The username field should be regarded as case insensitive. The important columns and data for the table can be found below.

My requirements now is to delete all but the most recent saved data. I'm writing an sql script for this, and started out by identifying all rows that are duplicates. This selection returns a table like below.

For each row, the most recent save is LASTUPDATEDDATE if it exist, otherwise CREATEDDATE. For this example, the most recent save for 'username' would be row 3.

ID CREATEDDATE LASTUPDATEDDATE USERNAME                           
-- ----------- --------------- --------
1  11-NOV-11                   USERNAME                                
2  01-NOV-11   02-NOV-11       username                            
3  8-JAN-12                    USERname

My script (which selects all rows where a duplicated username appears) looks like:


SELECT 
  id, createddate, lastupdateddate, username
FROM
  table
WHERE
  LOWER(username)
IN
  (
    SELECT 
      LOWER(username)
    FROM 
      table
    GROUP BY 
      LOWER(username)
    HAVING 
      COUNT(*) > 1
  )  
ORDER BY
  LOWER(username)

My question now is: How do I select everything but row 3? I have searched Stack Overflow for a good match to this question, but found no match good enough. I know I probably have to make a join of some kind, but can't really get my head around it. Would be really thankful for a push in the right direction.

We are using SQL Server, probably a quite new version.

Gordon Linoff · Accepted Answer

To delete duplicates, you can use:

with todelete as (
     select t.*,
            row_number() over (partition by lower(username) order by createddate desc) as seqnum
     from table
    )
delete from t
where seqnum > 1

This assigns a sequential number to each row, starting with 1 for the most recent. It then deletes all but the most recent.

For two dates, you can use:

with todelete as (
     select t.*,
            row_number() over (partition by lower(username) order by thedate desc) as seqnum
     from (select t.*,
                  (case when createddate >= coalesdce(updateddate, createddate)
                        then createddate 
                        else updateddate
                   end) as thedate
           from table
          ) t
    )
delete from t
where seqnum > 1

Select rows based on two columns in SQL Server

Answers (2)

Related Questions