supermario
supermario

Reputation: 2765

Mysql delete duplicate comments?

I want to clean up duplicated content on a comments table (1M rows) where users have posted twice (or more) the same comments. However I want to keep one instance of any repeated comment.

Here is the query that I came up with which finds and groups those comments:

SELECT author, body, COUNT(*) as count
FROM  db.comment
GROUP BY body
HAVING COUNT(*) > 1;

But don't know how to remove the repeated rows while leaving only one untouched. I have seen similar questions but none worked for me. So appreciate your hints.

Update:

mysql> describe comment;
+---------+-------------+------+-----+---------+----------------+
| Field   | Type        | Null | Key | Default | Extra          |
+---------+-------------+------+-----+---------+----------------+
| id      | int(11)     | NO   | PRI | NULL    | auto_increment |
| created | datetime    | NO   |     | NULL    |                |
| author  | varchar(60) | NO   |     | NULL    |                |
| body    | longtext    | NO   |     | NULL    |                |
| post_id | int(11)     | NO   | MUL | NULL    |                |
+---------+-------------+------+-----+---------+----------------+

Upvotes: 1

Views: 186

Answers (2)

Paul
Paul

Reputation: 26690

Unlike other DBMS, MySQL has a possibility to select all fields from table but do a grouping by only one of them. In this case only the first record of each group will be selected.

Do the job in two steps:

save IDs to remain in a temporary table:

INSERT INTO temp_comment(id)
SELECT id
FROM db.comment
GROUP BY author, body

delete all rows except the saved ones:

DELETE FROM db.comment WHERE id NOT IN (SELECT id FROM temp_comment);

Of course you need the temp_comment table to exist.

Upvotes: 1

Hotdin Gurning
Hotdin Gurning

Reputation: 1819

Is this what you want?

SELECT * FROM comments WHERE id NOT IN (
  SELECT id
  FROM  comments
  GROUP BY author,body
  HAVING COUNT(*) > 1
 )
AND author IN(
  SELECT author
  FROM  comments
  GROUP BY author,body
  HAVING COUNT(*) > 1
  )
AND body IN(
  SELECT body
  FROM  comments
  GROUP BY author,body
  HAVING COUNT(*) > 1
  );

To delete the duplicate rows, change SELECT * to DELETE

SQL Fiddle Demo

Update

To improve query performance, you can try this :

SELECT * FROM comments c
INNER JOIN 
(
  SELECT id,author,body
  FROM  comments
  GROUP BY author,body
  HAVING COUNT(*) > 1
 ) AS t
ON c.id NOT IN(t.id) AND c.author IN(t.author) AND c.body IN(t.body)

Upvotes: 1

Related Questions