Removing duplicate rows (based on values from multiple columns) from SQL Server

Question

+--------------+------------+-------------------+------------+
| id           | DATA_ID    |   SourceID        | DateAdded  |
+--------------+------------+-------------------+------------+
|           1  | 304019032  | 1                 |  2018-01-20|
|           2  | 345583556  | 1                 |  2018-01-21|
|           3  | 299951717  | 3                 |  2018-01-23|
|           4  | 304019032  | 2                 |  2018-01-24|
|           5  | 298519282  | 2                 |  2018-01-24|
|           6  | 299951717  | 3                 |  2018-01-27|
|           7  | 345583556  | 1                 |  2018-01-27|
+--------------+------------+-------------------+------------+

I'm trying to delete duplicate rows that have the same DATA_ID and SourceID, leaving the row with the most recent date.

Gordon Linoff · Accepted Answer

In SQL Server, I like to use row_number() and an updatable CTE for this:

with todelete as (
      select t.*, row_number() over (partition by sourceid, dataid order by dateadded desc) as seqnum
      from t
     )
delete from todelete
    where seqnum > 1;

This will work even when sourceid/dataid is NULL (which is probably not the case for you). It also does not assume that dateadded and id are mutually increasing (although that might be a reasonable assumption).

Removing duplicate rows (based on values from multiple columns) from SQL Server

Answers (2)

Related Questions