Reputation: 2765
I want to clean up duplicated content on a comments table (1M rows) where users have posted twice (or more) the same comments. However I want to keep one instance of any repeated comment.
Here is the query that I came up with which finds and groups those comments:
SELECT author, body, COUNT(*) as count
FROM db.comment
GROUP BY body
HAVING COUNT(*) > 1;
But don't know how to remove the repeated rows while leaving only one untouched. I have seen similar questions but none worked for me. So appreciate your hints.
Update:
mysql> describe comment;
+---------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| created | datetime | NO | | NULL | |
| author | varchar(60) | NO | | NULL | |
| body | longtext | NO | | NULL | |
| post_id | int(11) | NO | MUL | NULL | |
+---------+-------------+------+-----+---------+----------------+
Upvotes: 1
Views: 186
Reputation: 26690
Unlike other DBMS, MySQL has a possibility to select all fields from table but do a grouping by only one of them. In this case only the first record of each group will be selected.
Do the job in two steps:
save IDs to remain in a temporary table:
INSERT INTO temp_comment(id)
SELECT id
FROM db.comment
GROUP BY author, body
delete all rows except the saved ones:
DELETE FROM db.comment WHERE id NOT IN (SELECT id FROM temp_comment);
Of course you need the temp_comment
table to exist.
Upvotes: 1
Reputation: 1819
Is this what you want?
SELECT * FROM comments WHERE id NOT IN (
SELECT id
FROM comments
GROUP BY author,body
HAVING COUNT(*) > 1
)
AND author IN(
SELECT author
FROM comments
GROUP BY author,body
HAVING COUNT(*) > 1
)
AND body IN(
SELECT body
FROM comments
GROUP BY author,body
HAVING COUNT(*) > 1
);
To delete
the duplicate rows, change SELECT *
to DELETE
Update
To improve query performance, you can try this :
SELECT * FROM comments c
INNER JOIN
(
SELECT id,author,body
FROM comments
GROUP BY author,body
HAVING COUNT(*) > 1
) AS t
ON c.id NOT IN(t.id) AND c.author IN(t.author) AND c.body IN(t.body)
Upvotes: 1