Reputation:
I have a table with some ids + titles. I want to make the title column unique, but it has over 600k records already, some of which are duplicates (sometimes several dozen times over).
How do I remove all duplicates, except one, so I can add a UNIQUE key to the title column after?
Upvotes: 38
Views: 34015
Reputation: 377
Here's a method to remove duplicate entries from mysql db while keeping the ones without duplicates and also keeping the lowest id of the entries with duplicates:
DELETE FROM the_table
WHERE id not IN (
SELECT min(id)
FROM the_table
GROUP BY titles
);
Upvotes: 0
Reputation: 866
Deleting duplicates on MySQL tables is a common issue, that usually comes with specific needs. In case anyone is interested, here (Remove duplicate rows in MySQL) I explain how to use a temporary table to delete MySQL duplicates in a reliable and fast way (with examples for different use cases).
In this case, something like this should work:
-- create a new temporary table
CREATE TABLE tmp_table1 LIKE table1;
-- add a unique constraint
ALTER TABLE tmp_table1 ADD UNIQUE(id, title);
-- scan over the table to insert entries
INSERT IGNORE INTO tmp_table1 SELECT * FROM table1 ORDER BY sid;
-- rename tables
RENAME TABLE table1 TO backup_table1, tmp_table1 TO table1;
Upvotes: 0
Reputation: 2490
Below query can be used to delete all the duplicate except the one row with lowest "id" field value
DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id > t2.id AND t1.name = t2.name
In the similar way, we can keep the row with the highest value in 'id' as follows
DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id < t2.id AND t1.name = t2.name
Upvotes: 1
Reputation: 25504
Since the MySql ALTER IGNORE TABLE
has been deprecated, you need to actually delete the duplicate date before adding an index.
First write a query that finds all the duplicates. Here I'm assuming that email
is the field that contains duplicates.
SELECT
s1.email
s1.id,
s1.created
s2.id,
s2.created
FROM
student AS s1
INNER JOIN
student AS s2
WHERE
/* Emails are the same */
s1.email = s2.email AND
/* DON'T select both accounts,
only select the one created later.
The serial id could also be used here */
s2.created > s1.created
;
Next select only the unique duplicate ids:
SELECT
DISTINCT s2.id
FROM
student AS s1
INNER JOIN
student AS s2
WHERE
s1.email = s2.email AND
s2.created > s1.created
;
Once you are sure that only contains the duplicate ids you want to delete, run the delete. You have to add (SELECT * FROM tblname)
so that MySql doesn't complain.
DELETE FROM
student
WHERE
id
IN (
SELECT
DISTINCT s2.id
FROM
(SELECT * FROM student) AS s1
INNER JOIN
(SELECT * FROM student) AS s2
WHERE
s1.email = s2.email AND
s2.created > s1.created
);
Then create the unique index:
ALTER TABLE
student
ADD UNIQUE INDEX
idx_student_unique_email(email)
;
Upvotes: 1
Reputation: 16220
Create a new table with just the distinct rows of the original table. There may be other ways but I find this the cleanest.
CREATE TABLE tmp_table AS SELECT DISTINCT [....] FROM main_table
More specifically:
The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.
CREATE TABLE tempTableName LIKE tableName;
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;
DROP TABLE tableName;
INSERT tableName SELECT * FROM tempTableName;
DROP TABLE tempTableName;
Upvotes: 12
Reputation: 16
The solution posted by Nitin seems to be the most elegant / logical one.
However it has one issue:
ERROR 1093 (HY000): You can't specify target table 'student' for update in FROM clause
This can however be resolved by using (SELECT * FROM student) instead of student:
DELETE FROM student WHERE id IN (
SELECT distinct(s1.`student_id`) FROM (SELECT * FROM student) AS s1 INNER JOIN (SELECT * FROM student) AS s2
WHERE s1.`sex` = s2.`sex` AND
s1.`student_id` > s2.`student_id` AND
s1.`sex` = 'M'
ORDER BY `s1`.`student_id` ASC
)
Give your +1's to Nitin for coming up with the original solution.
Upvotes: 0
Reputation: 49
delete from student where id in (
SELECT distinct(s1.`student_id`) from student as s1 inner join student as s2
where s1.`sex` = s2.`sex` and
s1.`student_id` > s2.`student_id` and
s1.`sex` = 'M'
ORDER BY `s1`.`student_id` ASC
)
Upvotes: 0
Reputation: 879113
This command adds a unique key, and drops all rows that generate errors (due to the unique key). This removes duplicates.
ALTER IGNORE TABLE table ADD UNIQUE KEY idx1(title);
Edit: Note that this command may not work for InnoDB tables for some versions of MySQL. See this post for a workaround. (Thanks to "an anonymous user" for this information.)
Upvotes: 81
Reputation: 181
This shows how to do it in SQL2000. I'm not completely familiar with MySQL syntax but I'm sure there's something comparable
create table #titles (iid int identity (1, 1), title varchar(200))
-- Repeat this step many times to create duplicates
insert into #titles(title) values ('bob')
insert into #titles(title) values ('bob1')
insert into #titles(title) values ('bob2')
insert into #titles(title) values ('bob3')
insert into #titles(title) values ('bob4')
DELETE T FROM
#titles T left join
(
select title, min(iid) as minid from #titles group by title
) D on T.title = D.title and T.iid = D.minid
WHERE D.minid is null
Select * FROM #titles
Upvotes: 0