Reputation: 133

Remove duplicated rows from large table

I have large table (~1,000,000 rows) that may contain duplicates values.

the table contain two columns (for example col a, col b) that together represent unique key, ID and last update date.

for example I can have table like:

id | a | b | update

1 | jon | smith | 1/1

2 | don | smith | 2/5

3 | bob | david | 1/1

4 | dan | lewis | 3/1

5 | bob | david | 3/1

As you can see for id 3 and 5 the table contain the same values in both a and b columns. I would like to delete the rows that contain this kind of duplication , but keep the last updated row.

For this example I will have this table after deletion: id | a | b | update

1 | jon | smith | 1/1

2 | don | smith | 2/5

4 | dan | lewis | 3/1

5 | bob | davis | 3/1

(id = 3 deleted ,since I already have a=bob and b=davis in row where id=5 and the update in this row is higher then the one in the deleted row)

Upvotes: 2

Answers (3)

Teja

Reputation: 13544

The below one should work.

DELETE FROM MYTABLE WHERE ID IN( SELECT M1.ID FROM MYTABLE M1, MYTABLE M2 WHERE M1.A = M2.A AND M1.B = M2.B AND M1.ID < M2.ID);

Upvotes: 1

Jason 'Bug' Fenter

Reputation: 1624

You need to do two self-references in your WHERE clause. The first identifies rows with duplicates and the second will make sure that you're not deleting the most recent version.

DELETE
FROM     TestCase
WHERE    EXISTS (
   -- Where there's more than one
   SELECT   1
   FROM     TestCase AS Reference
   WHERE    TestCase.a = Reference.a
        AND TestCase.b = Reference.b
        AND TestCase.[update] <> Reference.[update]
   )
     AND TestCase.[update] <> (
   -- and this isn't the most recent
   SELECT   Max (Reference2.[update])
   FROM     TestCase AS Reference2
   WHERE    TestCase.a = Reference2.a
        AND TestCase.b = Reference2.b
   GROUP BY Reference2.a,
            Reference2.b
   )

Upvotes: 0

Sergey Kalinichenko

Reputation: 726987

delete from MyTable
where exists (
    select 1 from MyTable t2
    where MyTable.a=t2.a and MyTable.b=t2.b and MyTable.upd<t2.upd
)

Upvotes: 2

Remove duplicated rows from large table

Answers (3)

Related Questions