Reputation: 1360
I have the following table:
tbl
source type date
--- --- ---
google A 2010-02-25
google A 2013-04-11
facebook C 2008-10-22
facebook C 2007-01-28
I want to keep only a single entry of each source, and the criteria is select the source tuple with min(date) group by source
. The table consists of millions of records, and I'm looking for an efficient way to delete redundant records.
Upvotes: 1
Views: 5983
Reputation: 1567
Add an identity column to the duplicate table as a serial number that acts as a row unique identifier(auto incremental ascending order):
alter table tbl add sno int identity(1,1)
This query selects only non duplicated rows with min(date):
(select min(date),sno From tbl group by source)
So "sno" will be equals to "1" and "4".
Now join with this table, and delete the records of the join that are duplicated (t.sno is null)
delete E from tbl E
left join
(select min(date),sno From tbl group by source) T on E.sno=T.sno
where T.sno is null
Solution adapted from method 3 of this link: LINK
Upvotes: 0
Reputation: 1853
delete from t where date not in (select al.d from (select min(date) as d from t group by source )al);
Upvotes: 0
Reputation: 1242
In Microsoft SQL, you can try this.
;
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY source, type
ORDER BY createdate) RN
FROM tbsource)
DELETE FROM cte
WHERE RN > 1;
Upvotes: 0
Reputation: 1271111
In MySQL, you can do this using a join
:
delete t
from t join
(select source, min(date) as mindate
from t
group by source
) tt
on t.source = tt.source
where t.date > tt.mindate;
The only way -- off-hand -- that I can think to make this more efficient is to store the aggregation result in a subquery and add an index to it.
I can also add that regardless of the computation for determining the rows to delete, deleting lots of rows in a table in inefficient. Usually, I would recommend a three-step approach:
Upvotes: 3