Reputation: 4338
What's the best way to delete duplicate records in a mysql database using rails or mysql queries?
Upvotes: 13
Views: 36509
Reputation: 3290
Firstly do group by column on which you want to delete duplicate.But I am not doing it with group by.I am writing self join.
You don't need to create the temporary table.
Delete duplicate except one record: In this table it should have auto increment column. The possible solution that I've just come across:
DELETE n1 FROM names n1, names n2 WHERE n1.id > n2.id AND n1.name = n2.name
if you want to keep the row with the lowest auto increment id value OR
DELETE n1 FROM names n1, names n2 WHERE n1.id < n2.id AND n1.name = n2.name
if you want to keep the row with the highest auto increment id value.
You can cross check your solution, find duplicate again:
SELECT * FROM `names` GROUP BY name, id having count(name) > 1;
If it return 0 result, then you query is successful.
Upvotes: 0
Reputation: 558
Here is the rails solution I came up with. May not be the most efficient, but not a big deal if its a one time migration.
distinct_records = MyTable.all.group(:distinct_column_1, :distinct_column_2).map {|mt| mt.id}
duplicates = MyTable.all.to_a.reject!{|mt| distinct_records.include? mt.id}
duplicates.each(&:destroy)
First, groups by all columns that determine uniqueness, the example shows 2 but you could have more or less
Second, selects the inverse of that group...all other records
Third, Deletes all those records.
Upvotes: 0
Reputation: 14051
I used @krukid's answer above to do the following on a table with around 70,000 entries:
rs = 'select a, b, count(*) as c from table group by 1, 2 having c > 1'
# get a hashmap
dups = MyModel.connection.select_all(rs)
# convert to array
dupsarr = dups.map { |i| [i.a, i.b, i.c] }
# delete dups
dupsarr.each do |a,b,c|
ActiveRecord::Base.connection.execute("delete from table_name where a=#{MyModel.sanitize(a)} and b=#{MyModel.sanitize(b)} limit #{c-1}")
end
Upvotes: 0
Reputation: 11
suppose we have a table name tbl_product and there is duplicacy in the field p_pi_code and p_nats_id in maximum no of count then
first create a new table insert the data from existing table ...
ie from tbl_product to newtable1 if anything else then newtable1 to newtable2
CREATE TABLE `newtable2` (
`p_id` int(10) unsigned NOT NULL auto_increment,
`p_status` varchar(45) NOT NULL,
`p_pi_code` varchar(45) NOT NULL,
`p_nats_id` mediumint(8) unsigned NOT NULL,
`p_is_special` tinyint(4) NOT NULL,
PRIMARY KEY (`p_id`)
) ENGINE=InnoDB;
INSERT INTO newtable1 (p_status, p_pi_code, p_nats_id, p_is_special) SELECT
p_status, p_pi_code, p_nats_id, p_is_special FROM tbl_product group by p_pi_code;
INSERT INTO newtable2 (p_status, p_pi_code, p_nats_id, p_is_special) SELECT
p_status, p_pi_code, p_nats_id, p_is_special FROM newtable1 group by p_nats_id;
after that we see all the duplicacy in the field is removed
Upvotes: 1
Reputation: 17
I am using Alter Table
ALTER IGNORE TABLE jos_city ADD UNIQUE INDEX(`city`);
Upvotes: 0
Reputation: 11
If you have PK (id) in table (EMP) and want to older delete duplicate records with name column. For large data following query may be good approach.
DELETE t3
FROM (
SELECT t1.name, t1.id
FROM (
SELECT name
FROM EMP
GROUP BY name
HAVING COUNT(name) > 1
) AS t0 INNER JOIN EMP t1 ON t0.name = t1.name
) AS t2 INNER JOIN EMP t3 ON t3.name = t2.name
WHERE t2.id < t3.id;
Upvotes: 1
Reputation: 10353
SELECT DISTINCT(req_field) AS field, COUNT(req_field) AS fieldCount FROM
table_name GROUP BY req_field HAVING fieldCount > 1
DELETE FROM table_name
USING table_name, table_name AS vtable
WHERE
(table_name.id > vtable.id)
AND (table_name.req_field=req_field)
Replace req_field and table_name - should work without any issues.
Upvotes: 6
Reputation: 109
In MySql when I put something like
delete from A where IDA in (select IDA from A )
mySql said something like "you can't use the same table in the select part of the delete operation."
I've just have to delete some duplicate records, and I have succeeded with a .php program like that
<?php
...
$res = hacer_sql("SELECT MIN(IDESTUDIANTE) as IDTODELETE
FROM `estudiante` group by `LASTNAME`,`FIRSTNAME`,`CI`,`PHONE`
HAVING COUNT(*) > 1 )");
while ( $reg = mysql_fetch_assoc($res) ) {
hacer_sql("delete from estudiante where IDESTUDIANTE = {$reg['IDTODELETE']}");
}
?>
Upvotes: 0
Reputation: 4525
Here's another idea in no particular language:
rs = `select a, b, count(*) as c from entries group by 1, 2 having c > 1`
rs.each do |a, b, c|
`delete from entries where a=#{a} and b=#{b} limit #{c - 1}`
end
Edit:
Kudos to Olaf for that "having" hint :)
Upvotes: 8
Reputation: 1395
If your table has a PK (or you can easily give it one), you can specify any number of columns in the table to be equal (to qualify is as a duplicate) with the following query (may be a bit messy looking but it works):
DELETE FROM table WHERE pk_id IN(
SELECT DISTINCT t3.pk_id FROM (
SELECT t1.* FROM table AS t1 INNER JOIN (
SELECT col1, col2, col3, col4, COUNT(*) FROM table
GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
t1.col4 = t2.col4)
AS t3, (
SELECT t1.* FROM table AS t1 INNER JOIN (
SELECT col1, col2, col3, col4, COUNT(*) FROM table
GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
t1.col4 = t2.col4)
AS t4
WHERE t3.col1 = t4.col1 AND t3.pk_id > t4.pk_id
)
This will leave the first record entered into the database, deleting the 'newest' duplicates. If you want to keep the last record, switch the > to <.
Upvotes: 0
Reputation:
New to SQL :-) This is a classic question - often asked in interviews:-) I don't know whether it'll work in MYSQL but it works in most databases -
> create table t(
> a char(2),
> b char(2),
> c smallint )
> select a,b,c,count(*) from t
> group by a,b,c
> having count(*) > 1
a b c
-- -- ------ -----------
(0 rows affected)
> insert into t values ("aa","bb",1)
(1 row affected)
> insert into t values ("aa","bb",1)
(1 row affected)
> insert into t values ("aa","bc",1)
(1 row affected)
> select a,b,c,count(*) from t group by a,b,c having count(*) > 1
a b c
-- -- ------ -----------
aa bb 1 2
(1 row affected)
Upvotes: 4
Reputation: 10845
You can use:
http://lenniedevilliers.blogspot.com/2008/10/weekly-code-find-duplicates-in-sql.html
to get the duplicates and then just delete them via Ruby code or SQL code (I would do it in SQL code but thats up to you :-)
Upvotes: 0
Reputation: 2035
well, if it's a small table, from rails console you can do
class ActiveRecord::Base
def non_id_attributes
atts = self.attributes
atts.delete('id')
atts
end
end
duplicate_groups = YourClass.find(:all).group_by { |element| element.non_id_attributes }.select{ |gr| gr.last.size > 1 }
redundant_elements = duplicate_groups.map { |group| group.last - [group.last.first] }.flatten
redundant_elements.each(&:destroy)
Upvotes: 7
Reputation: 18484
I had to do this recently on Oracle, but the steps would have been the same on MySQL. It was a lot of data, at least compared to what I'm used to working with, so my process to de-dup was comparatively heavyweight. I'm including it here in case someone else comes along with a similar problem.
My duplicate records had different IDs, different updated_at
times, possibly different updated_by
IDs, but all other columns the same. I wanted to keep the most recently updated of any duplicate set.
I used a combination of Rails logic and SQL to get it done.
Step one: run a rake script to identify the IDs of the duplicate records, using model logic. IDs go in a text file.
Step two: create a temporary table with one column, the IDs to delete, loaded from the text file.
Step three: create another temporary table with all the records I'm going to delete (just in case!).
CREATE TABLE temp_duplicate_models
AS (SELECT * FROM models
WHERE id IN (SELECT * FROM temp_duplicate_ids));
Step four: actual deleting.
DELETE FROM models WHERE id IN (SELECT * FROM temp_duplicate_ids);
Upvotes: 0
Reputation: 30364
What you can do is copy the distinct records into a new table by:
select distinct * into NewTable from MyTable
Upvotes: 10