Reputation: 163

Find duplicate records in SQL Server, but also return the set of unique keys of each one

I have a SQL Server database with a unique key column and 49 columns of data elements (name/address/etc......). I have "duplicate" entries but with different keys and I want to find those duplicates entries.

As an example, I may have "John Smith" (with 47 other columns of information) in the table twice. Both John Smith entries will have a different unique key column, but other than that, all other columns would be identical. Including if one off the columns is NULL, then it will be NULL for both John Smith entries.

To complicate things, there are two tables which I need to join together, then once joined find any entries where data elements (everything except for the key) is the same.

Table1 layout

MyKey, table2ID, Col1, Col2, Col3....Col46.

Table2 layout

ID, col47, col48, col49

Col1 through to Col49 is where the "duplicate" data could be.

I have tried something like the below, which almost works. It fails if I have NULL values. For example, if Col22 is NULL on both John Smith entries (ie they are both the same NULL value) then they are not picked up in the selection.

Question: how do I get something like the below to work even when there are NULL values that need to be compared against each other.

with MyJoinedTable as
(
    select PolicyNumber, col01, col02, col03......col49
    from table1
    inner join table2 on table2id = table2.id
)
select PolicyNumber, t1.col01, t1.col02, t1.col03.......t1.col49
from MyJoinedTable t1
inner join (select col01, col02, col03......col49
            from MyJoinedTable
            group by col01, col02, col03......col49
            having count(*) > 1) t2 
      on t1.col01 = t2.col01
      and t1.col02 = t2.col02
      .......
      and t1.col49 = t2.col49
order by t1.col01, t1.col02

Upvotes: 0

Answers (2)

Gordon Linoff

Reputation: 1269693

One method is:

select t.*
from t
where exists (select 1
              from t t2
              where t2.col1 = t.col1 and
                    t2.col2 = t.col2 and
                    . . .
                    t2.policyNumber <> t.policyNumber
             );

This works assuming that none of the other columns are NULL.

EDIT:

If you are using SQL Server, I would just do:

select t.*
from (select t.*,
             min(id) over (partition by col1, col2, . . . ) as min_id,
             max(id) over (partition by col1, col2, . . . ) as max_id
      from t
     ) t
where minid <> maxid;

Upvotes: 0

sticky bit

Reputation: 37472

Group in a subquery with HAVING count(*) > 1 and join it back in.

SELECT to1.policynumber,
       to1.col1,
       ...
       to1.col49
       FROM elbat to1
            INNER JOIN (SELECT ti.col1,
                               ...
                               ti.col49
                               FROM elbat ti
                               GROUP BY col1,
                                        ...
                                        col49
                               HAVING count(*) > 1) to2
                       ON to2.col1 = to1.col1
                          ...
                          AND to2.col49 = to1.col49;

Or use an EXISTS.

SELECT to.policynumber,
       to.col1,
       ...
       to.col49
       FROM elbat to
       WHERE EXISTS (SELECT *
                            FROM elbat ti
                            WHERE ti.policynumber <> to.policynumber
                                  AND ti.col1 = to.col1
                                  ...
                                  AND ti.col49 = to.col49);

Upvotes: 0

Find duplicate records in SQL Server, but also return the set of unique keys of each one

Answers (2)

Related Questions