Reputation: 39874

How do I extract duplicate records from a table using multiple fields?

I've already checked out the question Deleting duplicate records using a temporary table and it doesn't quite go far enough to assist me with this question:

I have a table of approximately 200,000 address locations hosted on a SQL 2000 Server. This table has a huge problem with duplicate data in the table caused by invalid input from various parties over the years. I need to output a list of duplicate records so I can begin the long process of cleaning them up.

So consider the following table structure:

Table Company(   
  CompanyId NVarChar(10) Not Null Constraint PK_Locations Primary Key,   
  CompanyName NVarChar(30),   
  CompanyAddress NVarChar(30),   
  CompanyCity NVarchar(30), 
  CompanyState Char(2),   
  CompanyZip NVarChar(10),   
  DateCreated DateTime, 
  LastModified DateTime,  
  LastModifiedUser NVarChar(64) 
)

For the first parse I'm not even going to worry about typos and variations of spelling yet which is going to be a greater nightmare down the road that I haven't even got the first clue about solving yet.

So for this part a record is considered to be duplicate when multiple records match on the following conditions:

(CompanyName Or CompanyAddress) And CompanyCity And CompanyState

Zip is excluded because so many of the locations are missing zip/postal codes and so many are entered incorrectly that it just makes for a far less accurate report if I include them.

I realize that there may legitimately be multiple locations for a company within a single city/state [for instance McDonalds, just off the top of my head], and there may legitimately be multple companies at a single address within a city and state [for instance inside a shopping mall or office building], but for now we will consider that these at least warrant some level of human attention and will include them in the report.

Matches on single fields are a piece of cake, but I'm coming unstuck when I get to multiple fields, especially when some are conditional.

Upvotes: 0

Answers (3)

Quassnoi

Reputation: 425863

WITH q AS (
          SELECT Company.*,
                 ROW_NUMBER() OVER (PARTITION BY CompanyState, CompanyCity, CompanyName ORDER BY CompanyID) AS rnName,
                 ROW_NUMBER() OVER (PARTITION BY CompanyState, CompanyCity, CompanyAddress ORDER BY CompanyID) AS rnAddress
          FROM Company
          )
SELECT    *
WHERE     rnName > 1 OR rnAddress > 1

Note, though, that if your data will look like this:

CompanyID    CompanyName  CompanyAddress 
---------    -----------  --------------
1            McDonalds    Avenue 1
2            McDonalds    Avenue 2
3            Starbucks    Avenue 2

, then both records 2 and 3 will be deleted (which is what you requested but probably not what you wanted)

If you just want to list all rows having duplicates, then issue:

SELECT  *
FROM    Company co
WHERE   EXISTS
        (
        SELECT  1
        FROM    Company cn
        WHERE   cn.CompanyState = co.CompanyState
                AND cn.CompanyCity = co.CompanyCity
                AND cn.CompanyName = co.CompanyName
                AND cn.CompanyID <> co.CompanyID
        )
        OR EXISTS
        (
        SELECT  1
        FROM    Company ca
        WHERE   ca.CompanyState = co.CompanyState
                AND ca.CompanyCity = co.CompanyCity
                AND ca.CompanyAddress = co.CompanyAddress
                AND ca.CompanyID <> co.CompanyID
        )

This will work in SQL Server 2000 too.

Having indexes on (CompanyState, CompanyCity, CompanyName) and (CompanyState, CompanyCity, CompanyAddress) will greatly improve this query.

Upvotes: 2

Eppz

Reputation: 3226

Try something like this...

Select field1, field2, ... etc, count(*)
FROM Company,
GROUP BY field1, field2, ...
HAVING count(*) > 1

This should show you what is duplicated.

Upvotes: 0

Tom H

Reputation: 47402

SELECT
     C1.CompanyID,
     C2.CompanyID
FROM
     Company C1
INNER JOIN Company C2 ON
     (C2.CompanyName = C1.CompanyName OR C2.CompanyAddress = C1.CompanyAddress) AND
     C2.CompanyCity = C1.CompanyCity AND
     C2.CompanyState = C2.CompanyState AND
     C2.CompanyID > C1.CompanyID

If you have three or more matches then they will appear multiple times in the list. There are various ways to handle that depending on what exactly you want to get back from the query.

I would also strongly suggest that you look into better front-end coding to restrict how addresses are getting into your system as well as user training.

Upvotes: 0

How do I extract duplicate records from a table using multiple fields?

Answers (3)

Related Questions