gavsmith
gavsmith

Reputation: 455

SQL remove duplicates - keeping a record included in the highest count

I have a table with duplicate data similar to below example:

ID | ACCNO  | ACCNAME    | ADDRESS1     | ADDRESS2    | City
1  | 1001   | Joe B Ltd  | 123 Street1  |             | London
2  | 1001   | JoeB Ltd   | 123 Street1  |             | London
3  | 1001   | JoeB Ltd   | 123 Street1  |             | London
4  | 1001   | JoeB Ltd   | 123 Street1  | London      | London
5  | 1001   | JoeB Ltd   | 129 Street9  |             | London

ID is currently the unique primary key, however ACCNO should be when duplicates removed.

I've seen many queries to remove duplicate records such as https://stackoverflow.com/a/18719814/4949859

However I would like to choose which row to keep based on the count of duplicate rows. I believe that if I select a row from the grouped items with the highest count I'm most likely to get a correctly formatted address.

In my example using "NOT IN (SELECT MAX" or "MIN" will leave the wrong record in my case.

However when I use GROUP BY to get the highest count I can't include the ID field.

SELECT COUNT(ID), ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY FROM SUPPLIERS GROUP BY ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY ORDER BY COUNT(ID) DESC

This would give the result:

Count(ID) | ACCNO  | ACCNAME    | ADDRESS1     | ADDRESS2    | City
2         | 1001   | JoeB Ltd   | 123 Street1  |             | London
1         | 1001   | Joe B Ltd  | 123 Street1  |             | London
1         | 1001   | JoeB Ltd   | 123 Street1  | London      | London
1         | 1001   | JoeB Ltd   | 129 Street9  |             | London

Hope I'm making sense. I don't know how to return an ID (any) from a group where the count is highest. Does anybody else know how I might achieve this?

Edit:

I the above example grouping all columns except ID and getting a count, rows 2 and 3 would be grouped together giving a group count of 2 (the rest would be count ID of 1 as they are all unique) so I would want to keep row 2 or 3, doesn't matter which of those as they are both the same.

Edit 2:

I thought this was going to work:

DELETE
FROM SUPPLIERS
WHERE ID NOT IN 
 (SELECT TOP 1 MAX(ID) FROM SUPPLIERS  
  Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY 
  ORDER BY COUNT(ID) DESC)

Unfortunately this deletes all but one record, the select version of it looked promising:

SELECT *
FROM SUPPLIERS a
WHERE ID NOT IN 
 (SELECT TOP 1 MAX(ID) FROM SUPPLIERS b 
  WHERE a.ACCNO = b.ACCNO Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY 
  ORDER BY COUNT(ID) DESC)

Answer:

With thanks to user1751825 (marked as answer as got me closest to final result)

DELETE FROM SUPPLIERS WHERE ID IN (SELECT ID
FROM SUPPLIERS a
 WHERE ID NOT IN 
  (SELECT TOP 1 MAX(ID) FROM SUPPLIERS b 
  WHERE a.ACCNO = b.ACCNO Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY 
  ORDER BY COUNT(ID) DESC))

Upvotes: 0

Views: 110

Answers (3)

user1751825
user1751825

Reputation: 4309

I think this should do what you need.

delete from SUPPLIERS
where ID NOT IN (
    Select max(ID)
    FROM SUPPLIERS
    Group by ACCNO
)

Upvotes: 0

prilla75
prilla75

Reputation: 1

this worked for me.

Table

ID   ACCNO       ACCNAME         ADDRESS1        ADDRESS2        City
1    1001        Joe B Ltd       123 Street1                     London
2    1001        JoeB Ltd        123 Street1                     London
3    1001        JoeB Ltd        123 Street1                     London
4    1001        JoeB Ltd        123 Street1     London          London
5    1001        JoeB Ltd        129 Street9                     London
6    67          Nise            Gata1      
7    67          Nisse           Gata2      
8    67          Nisse           Gata1           Haninge         Stockholm

RESULT:

ACCNO    ACCNAME         ADDRESS1        ADDRESS2        City
1001     JoeB Ltd        123 Street1     London          London
67       Nisse           Gata1           Haninge         Stockholm

Code:

select distinct 
    [ ACCNO  ],
    FIRST_VALUE([ ACCNAME    ]) OVER (PARTITION BY [ ACCNO  ] ORDER BY case when [ ACCNAME    ] is null then 1 else 0 end, rownumber  ) as [ ACCNAME    ],
    FIRST_VALUE([ ADDRESS1     ]) OVER (PARTITION BY [ ACCNO  ] ORDER BY case when [ ADDRESS1     ] is null then 1 else 0 end, rownumber  ) as [ ADDRESS1     ],
    FIRST_VALUE([ ADDRESS2    ]) OVER (PARTITION BY [ ACCNO  ] ORDER BY case when [ ADDRESS2    ] is null then 1 else 0 end, rownumber  ) as [ ADDRESS2    ],
    FIRST_VALUE([ City]) OVER (PARTITION BY [ ACCNO  ] ORDER BY case when [ City] is null then 1 else 0 end, rownumber  ) as [ City]
FROM 
(
    SELECT 
          [ ACCNO  ]
          ,[ ACCNAME    ]
          ,[ ADDRESS1     ]
          ,case when ltrim(rtrim([ ADDRESS2    ] )) = '' then null else [ ADDRESS2    ] end as [ ADDRESS2    ] -- spaces = NULL
          ,case when ltrim(rtrim([ City] )) = '' then null else [ City] end as [ City]

          ,count(*) as quantity
          ,ROW_NUMBER() OVER (
                                PARTITION BY [ ACCNO  ] 
                                ORDER BY 
                                    [ ACCNO  ], 
                                    count(*) desc
                             ) as rownumber
      FROM [dbo].[test_sql]  
      GROUP BY cube ( [ ACCNO  ]
          ,[ ACCNAME    ]
          ,[ ADDRESS1     ]
          ,[ ADDRESS2    ]
          ,[ City])
        HAVING [ ACCNO  ] is not null 
 ) myGroup

Upvotes: 0

freakyhat
freakyhat

Reputation: 471

As per my understanding, in the example you provided, you want to keep the record with ID = 5 and delete the rest.

WITH CTE AS(
   SELECT ID, ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY,
       RN = ROW_NUMBER()OVER(PARTITION BY ACCNO ORDER BY ID DESC)
  FROM SUPPLIERS
)
DELETE FROM CTE WHERE RN > 1

That should do the trick!

Upvotes: 1

Related Questions