Richard Pianka
Richard Pianka

Reputation: 3356

SQL query to find duplicate rows, in any table

I'm looking for a schema-independent query. That is, if I have a users table or a purchases table, the query should be equally capable of catching duplicate rows in either table without any modification (other than the from clause, of course).

I'm using T-SQL, but I'm guessing there should be a general solution.

Upvotes: 5

Views: 22750

Answers (4)

ppijnenburg
ppijnenburg

Reputation: 153

I recently was looking into the same issue and noticed this question. I managed to solve it using a stored procedure with some dynamic SQL. This way you only need to specify the table name. And it will get all the other relevant data from sys tables.

/*
This SP returns all duplicate rows (1 line for each duplicate) for any given table.

to use the SP:
exec [database].[dbo].[sp_duplicates] 
    @table = '[database].[schema].[table]'  

*/
create proc dbo.sp_duplicates @table nvarchar(50) as

declare @query nvarchar(max)
declare @groupby nvarchar(max)

set @groupby =  stuff((select ',' + [name]
                FROM sys.columns
                WHERE object_id = OBJECT_ID(@table)
                FOR xml path('')), 1, 1, '')

set @query = 'select *, count(*)
                from '+@table+'
                group by '+@groupby+'
                having count(*) > 1'

exec (@query)

Upvotes: 0

Tom H
Tom H

Reputation: 47444

I believe that this should work for you. Keep in mind that CHECKSUM() isn't 100% perfect - it's theoretically possible to get a false positive here (I think), but otherwise you can just change the table name and this should work:

;WITH cte AS (
    SELECT
        *,
        CHECKSUM(*) AS chksum,
        ROW_NUMBER() OVER(ORDER BY GETDATE()) AS row_num
    FROM
        My_Table
)
SELECT
    *
FROM
    CTE T1
INNER JOIN CTE T2 ON
    T2.chksum = T1.chksum AND
    T2.row_num <> T1.row_num

The ROW_NUMBER() is needed so that you have some way of distinguishing rows. It requires an ORDER BY and that can't be a constant, so GETDATE() was my workaround for that.

Simply change the table name in the CTE and it should work without spelling out the columns.

Upvotes: 8

JohnD
JohnD

Reputation: 14747

I have done this using CTEs in SQL Server.

Here is a sample on how to delete dupes but you should be able to adapt it easily to find dupes:

WITH CTE (COl1, Col2, DuplicateCount)
AS
(
    SELECT COl1,Col2,
    ROW_NUMBER() OVER(PARTITION BY COl1,Col2 ORDER BY Col1) AS DuplicateCount
    FROM DuplicateRcordTable
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO

Here is a link to an article where I got the SQL:

http://blog.sqlauthority.com/2009/06/23/sql-server-2005-2008-delete-duplicate-rows/

Upvotes: 1

Conrad Frix
Conrad Frix

Reputation: 52645

I'm still confused about what "detecting them might be" but I'll give it a shot.

Excluding them is easy

e.g.

SELECT DISTINCT * FROM USERS

However if you wanted to only include them and a duplicate is all the fields than you have to do

SELECT 
   [Each and every field]
FROM
   USERS
GROUP BY
   [Each and every field]
HAVING COUNT(*) > 1  

You can't get away with just using (*) because you can't GROUP BY * so this requirement from your comments is difficult

a schema-independent means I don't want to specify all of the columns in the query

Unless that is you want to use dynamic SQL and read the columns from sys.columns or information_schema.columns

For example

DECLARE @colunns nvarchar(max)
SET  @colunns = ''

SELECT @colunns = @colunns  + '[' +  COLUMN_NAME  +'], ' 
FROM INFORMATION_SCHEMA.columns  
WHERE table_name = 'USERS'

SET  @colunns  = left(@colunns,len(@colunns ) - 1)


DECLARE @SQL nvarchar(max)
SET @SQL = 'SELECT '  + @colunns 
          + 'FROM  USERS' + 'GROUP BY ' 
          + @colunns 
           + ' Having Count(*) > 1'


exec sp_executesql @SQL

Please note you should read this The Curse and Blessings of Dynamic SQL if you haven't already

Upvotes: 2

Related Questions