Reputation:

Remove duplicates the simplest way / explanation

First I would like to remark that I (as a Newb) did search through several Q & A regarding duplicates in a table though unfortunately for me, I couldn't manipulate the code being used as answer.

My table is made out of a report being sorted in SQL Server 2008.

I would like to know how do I remove duplicate records and with an explanation.

"MyTable":

Column1   (PK-auto incremental table's record ID) 
Column2   (some TXT) 
Column3   (Some TXT)
Column4   (SmallDateTime)
Column5   is empty

Column5 will hold the value of SUM(count of deleted duplicates including this survived row)

The key to the solution in may case is if [column2 and column3] have multiple records with same content (hence Duplicates) only they don't always share the same date (column4).

From this:

col1  col2   col3  col4         col5
----  -----  ----  -----------  ----
1     [abc]  [4]   [10/1/2012]  null
2     [abc]  [1]   [12/1/2012]  null
3     [ghi]  [6]   [4/1/2012]   null
4     [def]  [5]   [8/1/2012]   null
5     [abc]  [4]   [10/1/2012]  null
6     [def]  [5]   [12/1/2012]  null
7     [ghi]  [6]   [15/1/2012]  null
8     [abc]  [4]   [17/1/2012]  null
9     [ghi]  [6]   [6/1/2012]   null
10    [abc]  [1]   [13/1/2012]  null

Into this:

col1  col2   col3  col4         col5
----  -----  ----  -----------  ----
8     [abc]  [4]   [17/1/2012]  2
10    [abc]  [1]   [13/1/2012]  3
6     [def]  [5]   [12/1/2012]  2
7     [ghi]  [6]   [15/1/2012]  3

Meaning leave the latest (1) as a representation of every duplicated record.

++ReEditing++

Aaron Bertrand shawnt00 e2nburner... and rest of u all i can't say how much i thank your Reply although i did not yet comprehend that mass of code. i am now going to check those codes but not b4 thanking you guys !!

when i first started to program and needed sql querys, after using

Select * From MyTable

... my 1'st SQL Statement ...

i said HEY i know SQL !!! .... Now ... look at that deep knowledge of you guys ... THANKS A LOT i know that this post in StackOverFlow will be further useful for other beginners too

Upvotes: 0

Answers (3)

shawnt00

Reputation: 17925

This is a simplistic approach. You might find merge to be better. These versions preserve the highest col1 value and modify the maxdate column. Aaron's preserves the row with the maxdate. That is a distinction that I doubt is important but should be noted.

update MyTable
set col4 = (
    select max(col4)
    from MyTable as m2
    where m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
),  col5 = (
    select count(*)
    from MyTable as m2
    where m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
)
where not exists (
    select *
    from MyTable as m2
    where
        m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
        and m2.col1 > MyTable.col1
        and m2.col4 > MyTable.col4 or m2.col4 = MyTable.col4 and m2.col1 > MyTable.col1
);

delete from MyTable
where exists (
    select *
    from MyTable as m2
    where
        m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
        and m2.col1 > MyTable.col1
);

EDIT 2 Here's my shot at a merge query

merge MyTable as target
using (
    select max(col1), col2, col3, max(col4), count(*)
    from Mytable
    group by col2, col3
) as source(id, col2, col3, maxdate, rowcount)
on (
        target.col1 = source.col1
    and target.col2 = target.col2
    and target.col3 = target.col3
)
when matched then
    update set col4 = maxdate, col5 = rowcount
when not matched then delete

EDIT 3 Preserve the row with the original maxdate, breaking ties on col1

-- option #1
update MyTable
set col5 = (
    select count(*)
    from MyTable as m2
    where m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
)
where not exists (
    select *
    from MyTable as m2
    where
        m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
        and m2.col4 > MyTable.col4 or m2.col4 = MyTable.col4 and m2.col1 > MyTable.col1
);

delete from MyTable
where exists (
    select *
    from MyTable as m2
    where
        m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
        and m2.col4 > MyTable.col4 or m2.col4 = MyTable.col4 and m2.col1 > MyTable.col1
);

-- option #2
merge MyTable as target
using (
    select max(col1), col2, col3, max(col4), count(*)
    from Mytable
    group by col2, col3
) as source(maxid, col2, col3, maxdate, rowcount)
on (
        target.col2 = target.col2
    and target.col3 = target.col3
    and target.col1 = maxid
    and target.col4 = maxdate
)
when matched then
    update set col5 = rowcount
when not matched then delete

Upvotes: 1

Aaron Bertrand

Reputation: 280429

This answer uses a common table expression to apply row_number() and count() to each "slice" of data (meaning grouped by col2 + col3). The count() is used to identify how many rows belong to each such group, and the row_number() is used to apply a "rank" ordered by col4 desc (1 = latest per group, 2 = 2nd latest, etc). This also uses col1 (which looks like a unique column) to break any ties. The CTE can be followed by a query such as a select, update, delete, etc. So you can run the first select to validate that these are the rows you want to keep, and that the counts are correct. If they are, then you can proceed with the updates and deletes. You'll notice that in all cases the row_number() output is used to identify the rows you keep or the rows you discard.

To identify the rows you want to keep:

;WITH n AS 
(
  SELECT col1, col2, col3, col4, 
    c = COUNT(*) OVER (PARTITION BY col2, col3),
    rn = ROW_NUMBER() OVER 
    (
      PARTITION BY col2, col3 ORDER BY col4 DESC, col1 DESC
    )
  FROM dbo.table_name
)
SELECT col1, col2, col3, col4, c
  FROM n WHERE rn = 1;

Once you've confirmed that those are the row you want to keep, you can update them like this:

;WITH n AS 
(
  SELECT col1, col2, col3, col4, col5, 
    c = COUNT(*) OVER (PARTITION BY col2, col3),
    rn = ROW_NUMBER() OVER 
    (
      PARTITION BY col2, col3 ORDER BY col4 DESC, col1 DESC
    )
  FROM dbo.table_name
)
UPDATE n SET col5 = c
  WHERE rn = 1;

Then delete the remainders this way:

;WITH n AS 
(
  SELECT col1, col2, col3, col4, 
    rn = ROW_NUMBER() OVER 
    (
      PARTITION BY col2, col3 ORDER BY col4 DESC, col1 DESC
    )
  FROM dbo.table_name
)
DELETE n WHERE rn > 1;

Or even more simply (assuming col5 was completely null before the update):

DELETE dbo.table_name WHERE col5 IS NULL;

Upvotes: 2

Erhan A

Reputation: 701

WITH a AS (
    SELECT  *,
            ROW_NUMBER() OVER (PARTITION BY colum2 ORDER BY colum3 desc) RowNum
    FROM    mytable
)
-- deleted rows will be:

delete from mytable
where [yourID] in

(SELECT [yourID]

FROM    a
WHERE   a.RowNum <> 1 )

Upvotes: 0

Remove duplicates the simplest way / explanation

Answers (3)

Related Questions