Reputation: 860
I have a huge database that I have almost over 10k row in my user table and there are 2700 duplicate email addresses.
Basically the application did not limit the users from registering their accounts with the same email address over and over again. I have cleaned the multiple ones -more than 2 times- manually, there weren't many, but there are 2700 email addresses with duplicate value occur at least 2 times. So I want to update the duplicate email addresses and change the email address with a smaller id number to something like from "[email protected]" to "[email protected]", basically adding "1" to the beginning of all duplicate email addresses. I can select and display the duplicate email addresses but could not find the way to update only one of the email addresses and leave the other on untouched.
My table structure is like id
username
email
password
.
Upvotes: 0
Views: 1224
Reputation: 42728
WITH cte AS ( SELECT id,
email,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) rn
FROM sourcetable )
UPDATE sourcetable src, cte
SET src.email = CONCAT(rn - 1, src.email)
WHERE src.id = cte.id
AND cte.rn > 1;
I want to update the duplicate email addresses and change the email address with a smaller id number
If so the ordering in window function must be reversed:
WITH cte AS ( SELECT id,
email,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id DESC) rn
FROM sourcetable )
UPDATE sourcetable src, cte
SET src.email = CONCAT(rn - 1, src.email)
WHERE src.id = cte.id
AND cte.rn > 1;
Upvotes: 2
Reputation: 44223
If you do not have MySQL 8:
Here I am just prepending the id
of the row to the email address:
UPDATE my_table JOIN (
SELECT email, MAX(id) AS max_id, COUNT(*) AS cnt FROM my_table
GROUP BY email
HAVING cnt > 1
) sq ON my_table.email = sq.email AND my_table.id <> sq.max_id
SET my_table.email = CONCAT( my_table.id, my_table.email)
;
The inner query:
SELECT email, MAX(id) AS max_id, COUNT(*) AS cnt FROM my_table
GROUP BY email
HAVING cnt > 1
looks for all emails that that are duplicated (i.e. there is more than one row with the same email address) and computes the row that has the maximum id
value for each email address. For the sample data in my DB-Fiddle demo, it would return the following:
| email | max_id | cnt |
| ---------------- | ------ | --- |
| [email protected] | 3 | 3 |
| [email protected] | 5 | 2 |
The above inner query is aliased as table sq
.
Now if I join my_table
with the above query as follows:
SELECT my_table.* from my_table join (
SELECT email, MAX(id) AS max_id, COUNT(*) AS cnt FROM my_table
GROUP BY email
HAVING cnt > 1
) sq on my_table.email = sq.email and my_table.id <> sq.max_id
I get:
| id | email |
| --- | ---------------- |
| 1 | [email protected] |
| 2 | [email protected] |
| 4 | [email protected] |
because I am selecting from my_table
all rows that have duplicate email addresses (condition my_table.email = sq.email
except for the rows that have the highest value of id
for each email address (condition my_table.id <> sq.max_id
).
It is the ids from the above join whose email addresses are to be modified.
Upvotes: 2