Reputation: 455
I have a table with duplicate data similar to below example:
ID | ACCNO | ACCNAME | ADDRESS1 | ADDRESS2 | City
1 | 1001 | Joe B Ltd | 123 Street1 | | London
2 | 1001 | JoeB Ltd | 123 Street1 | | London
3 | 1001 | JoeB Ltd | 123 Street1 | | London
4 | 1001 | JoeB Ltd | 123 Street1 | London | London
5 | 1001 | JoeB Ltd | 129 Street9 | | London
ID is currently the unique primary key, however ACCNO should be when duplicates removed.
I've seen many queries to remove duplicate records such as https://stackoverflow.com/a/18719814/4949859
However I would like to choose which row to keep based on the count of duplicate rows. I believe that if I select a row from the grouped items with the highest count I'm most likely to get a correctly formatted address.
In my example using "NOT IN (SELECT MAX" or "MIN" will leave the wrong record in my case.
However when I use GROUP BY to get the highest count I can't include the ID field.
SELECT COUNT(ID), ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY FROM SUPPLIERS GROUP BY ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY ORDER BY COUNT(ID) DESC
This would give the result:
Count(ID) | ACCNO | ACCNAME | ADDRESS1 | ADDRESS2 | City
2 | 1001 | JoeB Ltd | 123 Street1 | | London
1 | 1001 | Joe B Ltd | 123 Street1 | | London
1 | 1001 | JoeB Ltd | 123 Street1 | London | London
1 | 1001 | JoeB Ltd | 129 Street9 | | London
Hope I'm making sense. I don't know how to return an ID (any) from a group where the count is highest. Does anybody else know how I might achieve this?
Edit:
I the above example grouping all columns except ID and getting a count, rows 2 and 3 would be grouped together giving a group count of 2 (the rest would be count ID of 1 as they are all unique) so I would want to keep row 2 or 3, doesn't matter which of those as they are both the same.
Edit 2:
I thought this was going to work:
DELETE
FROM SUPPLIERS
WHERE ID NOT IN
(SELECT TOP 1 MAX(ID) FROM SUPPLIERS
Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY
ORDER BY COUNT(ID) DESC)
Unfortunately this deletes all but one record, the select version of it looked promising:
SELECT *
FROM SUPPLIERS a
WHERE ID NOT IN
(SELECT TOP 1 MAX(ID) FROM SUPPLIERS b
WHERE a.ACCNO = b.ACCNO Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY
ORDER BY COUNT(ID) DESC)
Answer:
With thanks to user1751825 (marked as answer as got me closest to final result)
DELETE FROM SUPPLIERS WHERE ID IN (SELECT ID
FROM SUPPLIERS a
WHERE ID NOT IN
(SELECT TOP 1 MAX(ID) FROM SUPPLIERS b
WHERE a.ACCNO = b.ACCNO Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY
ORDER BY COUNT(ID) DESC))
Upvotes: 0
Views: 110
Reputation: 4309
I think this should do what you need.
delete from SUPPLIERS
where ID NOT IN (
Select max(ID)
FROM SUPPLIERS
Group by ACCNO
)
Upvotes: 0
Reputation: 1
this worked for me.
Table
ID ACCNO ACCNAME ADDRESS1 ADDRESS2 City
1 1001 Joe B Ltd 123 Street1 London
2 1001 JoeB Ltd 123 Street1 London
3 1001 JoeB Ltd 123 Street1 London
4 1001 JoeB Ltd 123 Street1 London London
5 1001 JoeB Ltd 129 Street9 London
6 67 Nise Gata1
7 67 Nisse Gata2
8 67 Nisse Gata1 Haninge Stockholm
RESULT:
ACCNO ACCNAME ADDRESS1 ADDRESS2 City
1001 JoeB Ltd 123 Street1 London London
67 Nisse Gata1 Haninge Stockholm
Code:
select distinct
[ ACCNO ],
FIRST_VALUE([ ACCNAME ]) OVER (PARTITION BY [ ACCNO ] ORDER BY case when [ ACCNAME ] is null then 1 else 0 end, rownumber ) as [ ACCNAME ],
FIRST_VALUE([ ADDRESS1 ]) OVER (PARTITION BY [ ACCNO ] ORDER BY case when [ ADDRESS1 ] is null then 1 else 0 end, rownumber ) as [ ADDRESS1 ],
FIRST_VALUE([ ADDRESS2 ]) OVER (PARTITION BY [ ACCNO ] ORDER BY case when [ ADDRESS2 ] is null then 1 else 0 end, rownumber ) as [ ADDRESS2 ],
FIRST_VALUE([ City]) OVER (PARTITION BY [ ACCNO ] ORDER BY case when [ City] is null then 1 else 0 end, rownumber ) as [ City]
FROM
(
SELECT
[ ACCNO ]
,[ ACCNAME ]
,[ ADDRESS1 ]
,case when ltrim(rtrim([ ADDRESS2 ] )) = '' then null else [ ADDRESS2 ] end as [ ADDRESS2 ] -- spaces = NULL
,case when ltrim(rtrim([ City] )) = '' then null else [ City] end as [ City]
,count(*) as quantity
,ROW_NUMBER() OVER (
PARTITION BY [ ACCNO ]
ORDER BY
[ ACCNO ],
count(*) desc
) as rownumber
FROM [dbo].[test_sql]
GROUP BY cube ( [ ACCNO ]
,[ ACCNAME ]
,[ ADDRESS1 ]
,[ ADDRESS2 ]
,[ City])
HAVING [ ACCNO ] is not null
) myGroup
Upvotes: 0
Reputation: 471
As per my understanding, in the example you provided, you want to keep the record with ID
= 5 and delete the rest.
WITH CTE AS(
SELECT ID, ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY,
RN = ROW_NUMBER()OVER(PARTITION BY ACCNO ORDER BY ID DESC)
FROM SUPPLIERS
)
DELETE FROM CTE WHERE RN > 1
That should do the trick!
Upvotes: 1