infosec
infosec

Reputation: 39

Removing duplicate rows (based on values from multiple columns) from SQL Server

+--------------+------------+-------------------+------------+
| id           | DATA_ID    |   SourceID        | DateAdded  |
+--------------+------------+-------------------+------------+
|           1  | 304019032  | 1                 |  2018-01-20|
|           2  | 345583556  | 1                 |  2018-01-21|
|           3  | 299951717  | 3                 |  2018-01-23|
|           4  | 304019032  | 2                 |  2018-01-24|
|           5  | 298519282  | 2                 |  2018-01-24|
|           6  | 299951717  | 3                 |  2018-01-27|
|           7  | 345583556  | 1                 |  2018-01-27|
+--------------+------------+-------------------+------------+

I'm trying to delete duplicate rows that have the same DATA_ID and SourceID, leaving the row with the most recent date.

Upvotes: 1

Views: 97

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269443

In SQL Server, I like to use row_number() and an updatable CTE for this:

with todelete as (
      select t.*, row_number() over (partition by sourceid, dataid order by dateadded desc) as seqnum
      from t
     )
delete from todelete
    where seqnum > 1;

This will work even when sourceid/dataid is NULL (which is probably not the case for you). It also does not assume that dateadded and id are mutually increasing (although that might be a reasonable assumption).

Upvotes: 2

juergen d
juergen d

Reputation: 204746

delete from your_table
where id not in 
(
  select max(id)
  from your_table
  group by data_id, sourceID
)

Upvotes: 0

Related Questions