Reputation:
First I would like to remark that I (as a Newb) did search through several Q & A regarding duplicates in a table though unfortunately for me, I couldn't manipulate the code being used as answer.
My table is made out of a report being sorted in SQL Server 2008.
I would like to know how do I remove duplicate records and with an explanation.
"MyTable":
Column1 (PK-auto incremental table's record ID)
Column2 (some TXT)
Column3 (Some TXT)
Column4 (SmallDateTime)
Column5 is empty
Column5 will hold the value of SUM(count of deleted duplicates including this survived row)
The key to the solution in may case is if [column2 and column3]
have multiple records with same content (hence Duplicates) only they don't always share the same date (column4
).
From this:
col1 col2 col3 col4 col5
---- ----- ---- ----------- ----
1 [abc] [4] [10/1/2012] null
2 [abc] [1] [12/1/2012] null
3 [ghi] [6] [4/1/2012] null
4 [def] [5] [8/1/2012] null
5 [abc] [4] [10/1/2012] null
6 [def] [5] [12/1/2012] null
7 [ghi] [6] [15/1/2012] null
8 [abc] [4] [17/1/2012] null
9 [ghi] [6] [6/1/2012] null
10 [abc] [1] [13/1/2012] null
Into this:
col1 col2 col3 col4 col5
---- ----- ---- ----------- ----
8 [abc] [4] [17/1/2012] 2
10 [abc] [1] [13/1/2012] 3
6 [def] [5] [12/1/2012] 2
7 [ghi] [6] [15/1/2012] 3
Meaning leave the latest (1) as a representation of every duplicated record.
++ReEditing++
Aaron Bertrand shawnt00 e2nburner... and rest of u all i can't say how much i thank your Reply although i did not yet comprehend that mass of code. i am now going to check those codes but not b4 thanking you guys !!
when i first started to program and needed sql querys, after using
Select * From MyTable
... my 1'st SQL Statement ...
i said HEY i know SQL !!! .... Now ... look at that deep knowledge of you guys ... THANKS A LOT i know that this post in StackOverFlow will be further useful for other beginners too
Upvotes: 0
Views: 1864
Reputation: 17925
This is a simplistic approach. You might find merge
to be better. These versions preserve the highest col1 value and modify the maxdate column. Aaron's preserves the row with the maxdate. That is a distinction that I doubt is important but should be noted.
update MyTable
set col4 = (
select max(col4)
from MyTable as m2
where m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
), col5 = (
select count(*)
from MyTable as m2
where m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
)
where not exists (
select *
from MyTable as m2
where
m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
and m2.col1 > MyTable.col1
and m2.col4 > MyTable.col4 or m2.col4 = MyTable.col4 and m2.col1 > MyTable.col1
);
delete from MyTable
where exists (
select *
from MyTable as m2
where
m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
and m2.col1 > MyTable.col1
);
EDIT 2 Here's my shot at a merge
query
merge MyTable as target
using (
select max(col1), col2, col3, max(col4), count(*)
from Mytable
group by col2, col3
) as source(id, col2, col3, maxdate, rowcount)
on (
target.col1 = source.col1
and target.col2 = target.col2
and target.col3 = target.col3
)
when matched then
update set col4 = maxdate, col5 = rowcount
when not matched then delete
EDIT 3 Preserve the row with the original maxdate, breaking ties on col1
-- option #1
update MyTable
set col5 = (
select count(*)
from MyTable as m2
where m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
)
where not exists (
select *
from MyTable as m2
where
m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
and m2.col4 > MyTable.col4 or m2.col4 = MyTable.col4 and m2.col1 > MyTable.col1
);
delete from MyTable
where exists (
select *
from MyTable as m2
where
m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
and m2.col4 > MyTable.col4 or m2.col4 = MyTable.col4 and m2.col1 > MyTable.col1
);
-- option #2
merge MyTable as target
using (
select max(col1), col2, col3, max(col4), count(*)
from Mytable
group by col2, col3
) as source(maxid, col2, col3, maxdate, rowcount)
on (
target.col2 = target.col2
and target.col3 = target.col3
and target.col1 = maxid
and target.col4 = maxdate
)
when matched then
update set col5 = rowcount
when not matched then delete
Upvotes: 1
Reputation: 280429
This answer uses a common table expression to apply row_number() and count() to each "slice" of data (meaning grouped by col2 + col3). The count() is used to identify how many rows belong to each such group, and the row_number() is used to apply a "rank" ordered by col4 desc (1 = latest per group, 2 = 2nd latest, etc). This also uses col1 (which looks like a unique column) to break any ties. The CTE can be followed by a query such as a select, update, delete, etc. So you can run the first select to validate that these are the rows you want to keep, and that the counts are correct. If they are, then you can proceed with the updates and deletes. You'll notice that in all cases the row_number() output is used to identify the rows you keep or the rows you discard.
To identify the rows you want to keep:
;WITH n AS
(
SELECT col1, col2, col3, col4,
c = COUNT(*) OVER (PARTITION BY col2, col3),
rn = ROW_NUMBER() OVER
(
PARTITION BY col2, col3 ORDER BY col4 DESC, col1 DESC
)
FROM dbo.table_name
)
SELECT col1, col2, col3, col4, c
FROM n WHERE rn = 1;
Once you've confirmed that those are the row you want to keep, you can update them like this:
;WITH n AS
(
SELECT col1, col2, col3, col4, col5,
c = COUNT(*) OVER (PARTITION BY col2, col3),
rn = ROW_NUMBER() OVER
(
PARTITION BY col2, col3 ORDER BY col4 DESC, col1 DESC
)
FROM dbo.table_name
)
UPDATE n SET col5 = c
WHERE rn = 1;
Then delete the remainders this way:
;WITH n AS
(
SELECT col1, col2, col3, col4,
rn = ROW_NUMBER() OVER
(
PARTITION BY col2, col3 ORDER BY col4 DESC, col1 DESC
)
FROM dbo.table_name
)
DELETE n WHERE rn > 1;
Or even more simply (assuming col5 was completely null before the update):
DELETE dbo.table_name WHERE col5 IS NULL;
Upvotes: 2
Reputation: 701
WITH a AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY colum2 ORDER BY colum3 desc) RowNum
FROM mytable
)
-- deleted rows will be:
delete from mytable
where [yourID] in
(SELECT [yourID]
FROM a
WHERE a.RowNum <> 1 )
Upvotes: 0