Bora Sumer
Bora Sumer

Reputation: 860

Update duplicate email addresses on mysql database table

I have a huge database that I have almost over 10k row in my user table and there are 2700 duplicate email addresses.

Basically the application did not limit the users from registering their accounts with the same email address over and over again. I have cleaned the multiple ones -more than 2 times- manually, there weren't many, but there are 2700 email addresses with duplicate value occur at least 2 times. So I want to update the duplicate email addresses and change the email address with a smaller id number to something like from "[email protected]" to "[email protected]", basically adding "1" to the beginning of all duplicate email addresses. I can select and display the duplicate email addresses but could not find the way to update only one of the email addresses and leave the other on untouched.

My table structure is like id username email password.

Upvotes: 0

Views: 1224

Answers (2)

Akina
Akina

Reputation: 42728

WITH cte AS ( SELECT id, 
                     email, 
                     ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) rn
              FROM sourcetable )
UPDATE sourcetable src, cte
SET src.email = CONCAT(rn - 1, src.email)
WHERE src.id = cte.id
  AND cte.rn > 1;

fiddle


I want to update the duplicate email addresses and change the email address with a smaller id number

If so the ordering in window function must be reversed:

WITH cte AS ( SELECT id, 
                     email, 
                     ROW_NUMBER() OVER (PARTITION BY email ORDER BY id DESC) rn
              FROM sourcetable )
UPDATE sourcetable src, cte
SET src.email = CONCAT(rn - 1, src.email)
WHERE src.id = cte.id
  AND cte.rn > 1;

fiddle

Upvotes: 2

Booboo
Booboo

Reputation: 44223

If you do not have MySQL 8:

Here I am just prepending the id of the row to the email address:

UPDATE my_table JOIN (
    SELECT email, MAX(id) AS max_id, COUNT(*) AS cnt FROM my_table
    GROUP BY email
    HAVING cnt > 1
) sq ON my_table.email = sq.email AND my_table.id <> sq.max_id
SET my_table.email = CONCAT( my_table.id, my_table.email)
;

See DB-Fiddle

The inner query:

SELECT email, MAX(id) AS max_id, COUNT(*) AS cnt FROM my_table
GROUP BY email
HAVING cnt > 1

looks for all emails that that are duplicated (i.e. there is more than one row with the same email address) and computes the row that has the maximum id value for each email address. For the sample data in my DB-Fiddle demo, it would return the following:

| email            | max_id | cnt |
| ---------------- | ------ | --- |
| [email protected] | 3      | 3   |
| [email protected] | 5      | 2   |

The above inner query is aliased as table sq.

Now if I join my_table with the above query as follows:

SELECT my_table.* from my_table join (
  SELECT email, MAX(id) AS max_id, COUNT(*) AS cnt FROM my_table
    GROUP BY email
    HAVING cnt > 1
) sq on my_table.email = sq.email and my_table.id <> sq.max_id

I get:

| id  | email            |
| --- | ---------------- |
| 1   | [email protected] |
| 2   | [email protected] |
| 4   | [email protected] |

because I am selecting from my_table all rows that have duplicate email addresses (condition my_table.email = sq.email except for the rows that have the highest value of id for each email address (condition my_table.id <> sq.max_id).

It is the ids from the above join whose email addresses are to be modified.

Upvotes: 2

Related Questions