Reputation: 3093
I have a database like this:
users
id name email phone
1 bill [email protected]
2 bill [email protected] 123456789
3 susan [email protected]
4 john [email protected] 123456789
5 john [email protected] 987654321
I want to merge records considered duplicates based on the email field.
Trying to figure out how to use the following considerations.
with the highest id number (see the [email protected] row for an example.)
Here is a query I tried:
DELETE FROM users WHERE users.id NOT IN
(SELECT grouped.id FROM (SELECT DISTINCT ON (email) * FROM users) AS grouped)
Getting a syntax error.
I'm trying to get the database to transform to this, I can't figure out the correct query:
users
id name email phone
2 bill [email protected] 123456789
3 susan [email protected]
5 john [email protected] 987654321
Upvotes: 4
Views: 1748
Reputation: 521289
Here is one option using a delete join:
DELETE
FROM users
WHERE id NOT IN (SELECT id
FROM (
SELECT CASE WHEN COUNT(*) = 1
THEN MAX(id)
ELSE MAX(CASE WHEN phone IS NOT NULL THEN id END) END AS id
FROM users
GROUP BY email) t);
The logic of this delete is as follows:
id
value, where the phone is also defined.Upvotes: 3
Reputation: 147166
Here's a solution that will give you the latest data for each field for each user in the result table, thus meeting your second criterion as well as the first and third. It will work for as many duplicates as you have, subject to the group_concat_max_len
condition on GROUP_CONCAT
. It uses GROUP_CONCAT
to prepare a list of all values of a field for each user, sorted so that the most recent value is first. SUBSTRING_INDEX
is then used to extract the first value in that list, which is the most recent. This solution uses a CREATE TABLE ... SELECT
command to make a new users
table, then DROP
s the old one and renames the new table to users
.
CREATE TABLE users
(`id` int, `name` varchar(5), `email` varchar(19), `phone` int)
;
INSERT INTO users
(`id`, `name`, `email`, `phone`)
VALUES
(1, 'bill', '[email protected]', 123456789),
(2, 'bill', '[email protected]', NULL),
(3, 'susan', '[email protected]', NULL),
(4, 'john', '[email protected]', 123456789),
(5, 'john', '[email protected]', 987654321)
;
CREATE TABLE newusers AS
SELECT id
, SUBSTRING_INDEX(names, ',', 1) AS name
, email
, SUBSTRING_INDEX(phones, ',', 1) AS phone
FROM (SELECT id
, GROUP_CONCAT(name ORDER BY id DESC) AS names
, email
, GROUP_CONCAT(phone ORDER BY id DESC) AS phones
FROM users
GROUP BY email) u;
DROP TABLE users;
RENAME TABLE newusers TO users;
SELECT * FROM users
Output:
id name email phone
1 bill [email protected] 123456789
4 john [email protected] 987654321
3 susan [email protected] (null)
Upvotes: 1