Reputation: 139

Delete duplicate rows in different columns

Table1:

id   name  address1   address2  address3
------------------------------------------
1    Jenny    A         B        NULL
2    John     C        NULL      NULL
3    Jenny    B         A        NULL
4    John    NULL      NULL       C

.....

id1 and id3 are the same in this condition, id2 and id4 are the same too.

Can I delete id3 and id4? I'm using SQL Server 2019.

Upvotes: 0

Answers (3)

Gordon Linoff

Reputation: 1271151

I don't think fancy manipulation of the three columns is needed. Assuming you do not have duplicate values in the columns:

delete t1 from table1 t1
    where exists (select 1
                  from table1 tt1
                  where tt1.id < t1.id and
                        (t1.address1 in (tt1.address1, tt1.address2, tt1.address3) or t1.address1 is null
                        ) and
                        (t1.address2 in (tt1.address1, tt1.address2, tt1.address3) or t1.address2 is null
                        ) and
                        (t1.address3 in (tt1.address1, tt1.address2, tt1.address3) or t1.address3 is null
                        )  and
                        ( (case when t1.address1 is null then 1 else 0 end +
                           case when t1.address2 is null then 1 else 0 end +
                           case when t1.address3 is null then 1 else 0 end
                          ) =
                          (case when tt1.address1 is null then 1 else 0 end +
                           case when tt1.address2 is null then 1 else 0 end +
                           case when tt1.address3 is null then 1 else 0 end
                          )
                        )
             );

This checks that each non-NULL value matches and the number of NULL values is the same. This implies that the two sets of values are equivalent.

The advantage is that you don't have to worry about separators -- which is a concern for either string_agg() or XML.

Upvotes: 0

Zhorov

Reputation: 30023

You need to define a condition for duplicate rows. One possible approach to define such a condition is to aggregate ordered addresses. The following statement demonstrates this approach:

Table:

CREATE TABLE Data (
    id int,
    name varchar(10),  
    address1 varchar(100),   
    address2 varchar(100),  
    address3 varchar(100)
)
INSERT INTO Data
    (id, name, address1, address2, address3)
VALUES
    (1, 'Jenny', 'A',  'B',  NULL),
    (2, 'John',  'C',  NULL, NULL),
    (3, 'Jenny', 'B',  'A',  NULL),
    (4, 'John ', NULL, NULL, 'C')

Statement:

DELETE x
FROM (
    SELECT 
        d.*,
        -- Condition for equal addresses
        ROW_NUMBER() OVER (PARTITION BY c.CheckCondtition ORDER BY d.id) AS rn
        -- Condition for equal name and addresses
        -- ROW_NUMBER() OVER (PARTITION BY d.name, c.CheckCondtition ORDER BY d.id) AS rn
    FROM Data d
    CROSS APPLY (
        SELECT CONCAT(',', [address]) 
        FROM (VALUES (d.address1), (d.address2), (d.address3)) v([address]) 
        ORDER BY [address]
        FOR XML PATH('')
    ) c(CheckCondtition)
) x
WHERE x.rn > 1

Upvotes: 1

Nick

Reputation: 147266

If you're using SQL Server 2017 or later, you can build up a unique identifier for each row based on the name and each of the address fields using STRING_AGG; then find row numbers for each occurrence of that value (ordering by id), and then you can delete the rows with row number > 1:

WITH CTE AS (
  SELECT id, name AS value
  FROM data
  UNION ALL
  SELECT id, address1
  FROM data
  UNION ALL
  SELECT id, address2
  FROM data
  UNION ALL
  SELECT id, address3
  FROM data
),
CTE2 AS (
  SELECT id, STRING_AGG(value, '%') WITHIN GROUP (ORDER BY value) AS v
  FROM CTE
  GROUP BY id
),
CTE3 AS (
  SELECT id, v,
         ROW_NUMBER() OVER (PARTITION BY v ORDER BY id) AS rn
  FROM CTE2
)
DELETE d
FROM data d
JOIN CTE3 ON CTE3.id = d.id
WHERE CTE3.rn > 1

Demo on SQLFiddle

Upvotes: 1

Delete duplicate rows in different columns

Answers (3)

Related Questions