Reputation: 133
I have large table (~1,000,000 rows) that may contain duplicates values.
the table contain two columns (for example col a, col b) that together represent unique key, ID and last update date.
for example I can have table like:
id | a | b | update
1 | jon | smith | 1/1
2 | don | smith | 2/5
3 | bob | david | 1/1
4 | dan | lewis | 3/1
5 | bob | david | 3/1
As you can see for id 3 and 5 the table contain the same values in both a and b columns. I would like to delete the rows that contain this kind of duplication , but keep the last updated row.
For this example I will have this table after deletion: id | a | b | update
1 | jon | smith | 1/1
2 | don | smith | 2/5
4 | dan | lewis | 3/1
5 | bob | davis | 3/1
(id = 3 deleted ,since I already have a=bob and b=davis in row where id=5 and the update in this row is higher then the one in the deleted row)
Upvotes: 2
Views: 2620
Reputation: 13544
The below one should work.
DELETE
FROM MYTABLE
WHERE ID IN(
SELECT M1.ID
FROM MYTABLE M1,
MYTABLE M2
WHERE M1.A = M2.A
AND M1.B = M2.B
AND M1.ID < M2.ID);
Upvotes: 1
Reputation: 1624
You need to do two self-references in your WHERE clause. The first identifies rows with duplicates and the second will make sure that you're not deleting the most recent version.
DELETE
FROM TestCase
WHERE EXISTS (
-- Where there's more than one
SELECT 1
FROM TestCase AS Reference
WHERE TestCase.a = Reference.a
AND TestCase.b = Reference.b
AND TestCase.[update] <> Reference.[update]
)
AND TestCase.[update] <> (
-- and this isn't the most recent
SELECT Max (Reference2.[update])
FROM TestCase AS Reference2
WHERE TestCase.a = Reference2.a
AND TestCase.b = Reference2.b
GROUP BY Reference2.a,
Reference2.b
)
Upvotes: 0
Reputation: 726987
delete from MyTable
where exists (
select 1 from MyTable t2
where MyTable.a=t2.a and MyTable.b=t2.b and MyTable.upd<t2.upd
)
Upvotes: 2