user1397417
user1397417

Reputation: 738

how do i find all duplicate entries in mysql when its two columns?

i have a table eng-jap which is essentially just a translation so having an english and a japanese column. a script i made somehow cause every insert to have a clone and thus 1000s of duplicate entries in this table, for example:

duplicate example A

eng                        jap
"mother washes every day"  "母は毎日洗濯する"
"mother washes every day"  "母は毎日洗濯する"

if it were just one column i could use the query:

SELECT eng, COUNT(*) c FROM `eng-jap` GROUP BY eng HAVING c > 1

but since the table can legitimately have a duplicates in eng or jap, as long as its not in both. for example:

duplicate example B

eng                        jap
"mother washes every day"  "母は毎日洗濯する"
"every day mother washes"  "母は毎日洗濯する"

this is to allow one sentence to have more than one translation. so i need to alter the query to find duplicates as a combination of both columns i guess you could say.

once again to be clear. example B is fine, i want to select all duplicates like example A so i can make a scrip to remove one of all of the duplicates. please and Thank you!

Upvotes: 2

Views: 2055

Answers (3)

nawfal
nawfal

Reputation: 73293

I'm not sure if you have already got answers here. I'm also not sure if you're trying to read only duplicated rows, or you want a result set with no duplicates. It all depends on what the definition of a duplicate is for you.

1) Assuming you need only duplicated rows (as your title says):

a) If in example A there is a duplicate and in example B there are no duplicates, then just:

SELECT eng, COUNT(*) c FROM `eng-jap` GROUP BY eng, jap HAVING c > 1

b) If you find a duplicate even in example B (since jap part is the same), then:

SELECT   * 
FROM     (
          SELECT   * 
          FROM     (
                    SELECT   * 
                    FROM     `eng-jap` 
                    GROUP BY eng 
                    HAVING   COUNT(*) > 1

                    UNION ALL

                    SELECT   * 
                    FROM     `eng-jap` 
                    GROUP BY jap 
                    HAVING   COUNT(*) > 1
                   ) AS t
          GROUP BY eng
             ) AS v
GROUP BY jap

2) If you're looking at records with no duplicate rows (since you say at the last i want to select all duplicates like example A so i can make a scrip to remove one of all of the duplicates):

a) If in example A there is a duplicate and in example B there are no duplicates, then just:

SELECT eng, COUNT(*) c FROM `eng-jap` GROUP BY eng, jap

b) If you find a duplicate even in example B (since jap part is the same), then:

SELECT   * 
FROM     (SELECT c.eng, c.jap FROM `eng-jap` c GROUP BY c.eng) t 
GROUP BY t.jap

Upvotes: 0

fthiella
fthiella

Reputation: 49089

I think you just need to group by eng and jap:

SELECT eng, jap, COUNT(*) c FROM `eng-jap` GROUP BY eng, jap HAVING c > 1

And if you want to remove all duplicates, if your rows have an id, this query shows all the ids that you have to keep:

select
  SUBSTRING_INDEX(GROUP_CONCAT(CAST(id AS CHAR) order by id), ',', 1) as id
from `eng-jap`
group by eng, jap

(it's a trick that uses GROUP_CONCAT to find the first id of every combination of eng/jap). And this query shows the ids of the rows you have to delete:

select id
from
  `eng-jap`
     left join
  (select
     SUBSTRING_INDEX(GROUP_CONCAT(CAST(id AS CHAR) order by id), ',', 1) as id
     from `eng-jap`
     group by eng, jap) `eng-jap-dup`
  on `eng-jap`.id = `eng-jap-dup`.id
where `eng-jap-dup`.id is null

I rewrote this query using just join, it has to be a little faster, but if your table is too big it is probably still slow.

If it is still too slow and it still doesn't work, i would suggest you to add two more columns to your table:

  • eng-hash, where you can save MD5(eng)
  • jap-hash, where you can save MD5(jap)

then update all of your records like this:

update `eng-jap` set `eng-jap`.`eng-hash` = MD5(eng), `eng-jap`.`jap-hash` = MD5(jap)

then you can add a unique index on the table on both columns, ignore all errors, and let MySql do the work to eliminate duplicates for you:

alter ignore table `eng-jap` add unique index (eng-hash, jap-hash);

(if you get an error while creating index, see this question: MySQL: ALTER IGNORE TABLE gives "Integrity constraint violation")

Upvotes: 2

arkascha
arkascha

Reputation: 42984

You could temporarily add a column where you simply concatenate the value of both current columns. Then you make your query grouping by that column and eliminate matching rows. Afterwards you drop the column again.

The same can be done inside the query itself of course, just group by a synthesized result column that contains the concatenation of both physival columns.

Upvotes: 1

Related Questions