Reputation: 738
i have a table eng-jap
which is essentially just a translation so having an english and a japanese column. a script i made somehow cause every insert to have a clone and thus 1000s of duplicate entries in this table, for example:
duplicate example A
eng jap
"mother washes every day" "母は毎日洗濯する"
"mother washes every day" "母は毎日洗濯する"
if it were just one column i could use the query:
SELECT eng, COUNT(*) c FROM `eng-jap` GROUP BY eng HAVING c > 1
but since the table can legitimately have a duplicates in eng or jap, as long as its not in both. for example:
duplicate example B
eng jap
"mother washes every day" "母は毎日洗濯する"
"every day mother washes" "母は毎日洗濯する"
this is to allow one sentence to have more than one translation. so i need to alter the query to find duplicates as a combination of both columns i guess you could say.
once again to be clear. example B is fine, i want to select all duplicates like example A so i can make a scrip to remove one of all of the duplicates. please and Thank you!
Upvotes: 2
Views: 2055
Reputation: 73293
I'm not sure if you have already got answers here. I'm also not sure if you're trying to read only duplicated rows, or you want a result set with no duplicates. It all depends on what the definition of a duplicate is for you.
1) Assuming you need only duplicated rows (as your title says):
a) If in example A there is a duplicate and in example B there are no duplicates, then just:
SELECT eng, COUNT(*) c FROM `eng-jap` GROUP BY eng, jap HAVING c > 1
b) If you find a duplicate even in example B (since jap part is the same), then:
SELECT *
FROM (
SELECT *
FROM (
SELECT *
FROM `eng-jap`
GROUP BY eng
HAVING COUNT(*) > 1
UNION ALL
SELECT *
FROM `eng-jap`
GROUP BY jap
HAVING COUNT(*) > 1
) AS t
GROUP BY eng
) AS v
GROUP BY jap
2) If you're looking at records with no duplicate rows (since you say at the last i want to select all duplicates like example A so i can make a scrip to remove one of all of the duplicates
):
a) If in example A there is a duplicate and in example B there are no duplicates, then just:
SELECT eng, COUNT(*) c FROM `eng-jap` GROUP BY eng, jap
b) If you find a duplicate even in example B (since jap part is the same), then:
SELECT *
FROM (SELECT c.eng, c.jap FROM `eng-jap` c GROUP BY c.eng) t
GROUP BY t.jap
Upvotes: 0
Reputation: 49089
I think you just need to group by eng and jap:
SELECT eng, jap, COUNT(*) c FROM `eng-jap` GROUP BY eng, jap HAVING c > 1
And if you want to remove all duplicates, if your rows have an id
, this query shows all the ids that you have to keep:
select
SUBSTRING_INDEX(GROUP_CONCAT(CAST(id AS CHAR) order by id), ',', 1) as id
from `eng-jap`
group by eng, jap
(it's a trick that uses GROUP_CONCAT
to find the first id of every combination of eng/jap
). And this query shows the ids of the rows you have to delete:
select id
from
`eng-jap`
left join
(select
SUBSTRING_INDEX(GROUP_CONCAT(CAST(id AS CHAR) order by id), ',', 1) as id
from `eng-jap`
group by eng, jap) `eng-jap-dup`
on `eng-jap`.id = `eng-jap-dup`.id
where `eng-jap-dup`.id is null
I rewrote this query using just join, it has to be a little faster, but if your table is too big it is probably still slow.
If it is still too slow and it still doesn't work, i would suggest you to add two more columns to your table:
eng-hash
, where you can save MD5(eng)
jap-hash
, where you can save MD5(jap)
then update all of your records like this:
update `eng-jap` set `eng-jap`.`eng-hash` = MD5(eng), `eng-jap`.`jap-hash` = MD5(jap)
then you can add a unique index on the table on both columns, ignore all errors, and let MySql do the work to eliminate duplicates for you:
alter ignore table `eng-jap` add unique index (eng-hash, jap-hash);
(if you get an error while creating index, see this question: MySQL: ALTER IGNORE TABLE gives "Integrity constraint violation")
Upvotes: 2
Reputation: 42984
You could temporarily add a column where you simply concatenate the value of both current columns. Then you make your query grouping by that column and eliminate matching rows. Afterwards you drop the column again.
The same can be done inside the query itself of course, just group by a synthesized result column that contains the concatenation of both physival columns.
Upvotes: 1