Jordash
Jordash

Reputation: 3093

Merge Duplicate Rows in MySQL

I have a database like this:

users
id    name    email                phone
1     bill    [email protected]
2     bill    [email protected]   123456789
3     susan   [email protected]
4     john    [email protected]   123456789
5     john    [email protected]   987654321

I want to merge records considered duplicates based on the email field.

Trying to figure out how to use the following considerations.

  1. Merge based on duplicate email
  2. If one row has a null value use the row that has the most data.
  3. If 2 rows are duplicates but other fields are different then use the one

with the highest id number (see the [email protected] row for an example.)

Here is a query I tried:

DELETE FROM users WHERE users.id NOT IN 
(SELECT grouped.id FROM (SELECT DISTINCT ON (email) * FROM users) AS grouped)

Getting a syntax error.

I'm trying to get the database to transform to this, I can't figure out the correct query:

users
id   name    email                 phone
2    bill    [email protected]    123456789
3    susan   [email protected]   
5    john    [email protected]    987654321

Upvotes: 4

Views: 1748

Answers (2)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521289

Here is one option using a delete join:

DELETE
FROM users
WHERE id NOT IN (SELECT id
                 FROM (
                     SELECT CASE WHEN COUNT(*) = 1
                                 THEN MAX(id)
                                 ELSE MAX(CASE WHEN phone IS NOT NULL THEN id END) END AS id
                     FROM users
                     GROUP BY email) t);

The logic of this delete is as follows:

  • Emails where there is only one record are not deleted
  • For emails with two or more records, we delete everything except for the record having the highest id value, where the phone is also defined.

Upvotes: 3

Nick
Nick

Reputation: 147166

Here's a solution that will give you the latest data for each field for each user in the result table, thus meeting your second criterion as well as the first and third. It will work for as many duplicates as you have, subject to the group_concat_max_len condition on GROUP_CONCAT. It uses GROUP_CONCAT to prepare a list of all values of a field for each user, sorted so that the most recent value is first. SUBSTRING_INDEX is then used to extract the first value in that list, which is the most recent. This solution uses a CREATE TABLE ... SELECT command to make a new users table, then DROPs the old one and renames the new table to users.

CREATE TABLE users
    (`id` int, `name` varchar(5), `email` varchar(19), `phone` int)
;

INSERT INTO users
    (`id`, `name`, `email`, `phone`)
VALUES
    (1, 'bill', '[email protected]', 123456789),
    (2, 'bill', '[email protected]', NULL),
    (3, 'susan', '[email protected]', NULL),
    (4, 'john', '[email protected]', 123456789),
    (5, 'john', '[email protected]', 987654321)
;

CREATE TABLE newusers AS
SELECT id
     , SUBSTRING_INDEX(names, ',', 1) AS name
     , email
     , SUBSTRING_INDEX(phones, ',', 1) AS phone
FROM (SELECT id
           , GROUP_CONCAT(name ORDER BY id DESC) AS names
           , email
           , GROUP_CONCAT(phone ORDER BY id DESC) AS phones
      FROM users
      GROUP BY email) u;

DROP TABLE users;

RENAME TABLE newusers TO users;

SELECT * FROM users

Output:

id  name    email                   phone
1   bill    [email protected]      123456789
4   john    [email protected]      987654321
3   susan   [email protected]     (null)

Demo on SQLFiddle

Upvotes: 1

Related Questions