Reputation: 39874
I've already checked out the question Deleting duplicate records using a temporary table and it doesn't quite go far enough to assist me with this question:
I have a table of approximately 200,000 address locations hosted on a SQL 2000 Server. This table has a huge problem with duplicate data in the table caused by invalid input from various parties over the years. I need to output a list of duplicate records so I can begin the long process of cleaning them up.
So consider the following table structure:
Table Company(
CompanyId NVarChar(10) Not Null Constraint PK_Locations Primary Key,
CompanyName NVarChar(30),
CompanyAddress NVarChar(30),
CompanyCity NVarchar(30),
CompanyState Char(2),
CompanyZip NVarChar(10),
DateCreated DateTime,
LastModified DateTime,
LastModifiedUser NVarChar(64)
)
For the first parse I'm not even going to worry about typos and variations of spelling yet which is going to be a greater nightmare down the road that I haven't even got the first clue about solving yet.
So for this part a record is considered to be duplicate when multiple records match on the following conditions:
(CompanyName Or CompanyAddress) And CompanyCity And CompanyState
Zip is excluded because so many of the locations are missing zip/postal codes and so many are entered incorrectly that it just makes for a far less accurate report if I include them.
I realize that there may legitimately be multiple locations for a company within a single city/state [for instance McDonalds, just off the top of my head], and there may legitimately be multple companies at a single address within a city and state [for instance inside a shopping mall or office building], but for now we will consider that these at least warrant some level of human attention and will include them in the report.
Matches on single fields are a piece of cake, but I'm coming unstuck when I get to multiple fields, especially when some are conditional.
Upvotes: 0
Views: 5442
Reputation: 425863
WITH q AS (
SELECT Company.*,
ROW_NUMBER() OVER (PARTITION BY CompanyState, CompanyCity, CompanyName ORDER BY CompanyID) AS rnName,
ROW_NUMBER() OVER (PARTITION BY CompanyState, CompanyCity, CompanyAddress ORDER BY CompanyID) AS rnAddress
FROM Company
)
SELECT *
WHERE rnName > 1 OR rnAddress > 1
Note, though, that if your data will look like this:
CompanyID CompanyName CompanyAddress
--------- ----------- --------------
1 McDonalds Avenue 1
2 McDonalds Avenue 2
3 Starbucks Avenue 2
, then both records 2
and 3
will be deleted (which is what you requested but probably not what you wanted)
If you just want to list all rows having duplicates, then issue:
SELECT *
FROM Company co
WHERE EXISTS
(
SELECT 1
FROM Company cn
WHERE cn.CompanyState = co.CompanyState
AND cn.CompanyCity = co.CompanyCity
AND cn.CompanyName = co.CompanyName
AND cn.CompanyID <> co.CompanyID
)
OR EXISTS
(
SELECT 1
FROM Company ca
WHERE ca.CompanyState = co.CompanyState
AND ca.CompanyCity = co.CompanyCity
AND ca.CompanyAddress = co.CompanyAddress
AND ca.CompanyID <> co.CompanyID
)
This will work in SQL Server 2000
too.
Having indexes on (CompanyState, CompanyCity, CompanyName)
and (CompanyState, CompanyCity, CompanyAddress)
will greatly improve this query.
Upvotes: 2
Reputation: 3226
Try something like this...
Select field1, field2, ... etc, count(*)
FROM Company,
GROUP BY field1, field2, ...
HAVING count(*) > 1
This should show you what is duplicated.
Upvotes: 0
Reputation: 47402
SELECT
C1.CompanyID,
C2.CompanyID
FROM
Company C1
INNER JOIN Company C2 ON
(C2.CompanyName = C1.CompanyName OR C2.CompanyAddress = C1.CompanyAddress) AND
C2.CompanyCity = C1.CompanyCity AND
C2.CompanyState = C2.CompanyState AND
C2.CompanyID > C1.CompanyID
If you have three or more matches then they will appear multiple times in the list. There are various ways to handle that depending on what exactly you want to get back from the query.
I would also strongly suggest that you look into better front-end coding to restrict how addresses are getting into your system as well as user training.
Upvotes: 0