Reputation: 5070
I have database which I must clean up. There are millions of records and I have duplicates (duplicates are those which have the same 2nd column, C_DataSourceID
). I need to remove all the duplicates except the newest one, according to 3rd column C_DateCreated
.
I was trying to use this:
SELECT C_EmailAddress, C_DataSourceID, C_DateCreated, count(*)
FROM duplicates GROUP BY C_DataSourceID
HAVING count(*) > 1
But it returns duplicates with the amount of the duplicates. Is there a way to get all, but one, duplicates according to C_DataSourceID
?
I don't have an ID, but I might add ID. The database is empty so far, but it is going to have many records. Should I add ID? I need those records separated not grouped by.
Upvotes: 1
Views: 83
Reputation: 26
Assuming there are no duplicates that have the same C_DateCreated:
select d.C_EmailAddress, d.C_DataSourceID, d.C_DateCreated
from duplicates as d
where d.C_DateCreated !=
(select max(d2.C_DateCreated)
from duplicates d2
where d2.C_DataSourceId = d.C_DataSourceId and
d.C_EmailAddress = d.C_EmailAddress)
Not the most performant way of doing things, but it'll work...
Upvotes: 1
Reputation: 1269443
Here is one way using ANSI SQL for solving this problem:
delete
from duplicates d
where not exists (select 1
from duplicates d2
where d2.C_DataSourceID = d.C_DataSourceID and
d2.C_DateCreated > d.C_DateCreated
)
However, not all databases support this syntax.
Upvotes: 2