Reputation: 767
the goal of the query is also to find possible duplicates of names that were mistyped. Example:
International Group Inc.
must be find as a duplicate of International, Group Inc
In order to accomplish this a used the next query:
SELECT C.id,
C.name,
C.address,
C.city_id
FROM company C
INNER JOIN (SELECT name
FROM company
GROUP BY name
HAVING Count(id) > 1) D
ON Replace(Replace(C.name, '.', ''), ',', '') =
Replace(Replace(D.name, '.', ''), ',', '')
It works very well and the result came at 40 secs
but adding an extra condition like AND C.city_id='4'
requires an extra minute or more; This is still acceptable but not preferable.
My real problem occurs when I try to add another condition to find out only duplicates of companies that have a specific string in the name, using this condition AND C.name LIKE '%International%'
, this just don't return any results.
Could somebody help me figure out what I am doing wrong?
Thanks
Upvotes: 5
Views: 1370
Reputation: 29759
Because you are joining on the result of a function, the query cannot use any index. Besides, the cost of executing the REPLACE()
on all rows is probably not negligible.
I suggest you first add an indexed column that receives the "stripped-down" version of the strings, and then run the query with a join on this column:
ALTER TABLE company ADD COLUMN stripped_name VARCHAR(50);
ALTER TABLE company ADD INDEX(stripped_name);
UPDATE TABLE company SET stripped_name = REPLACE(REPLACE(name, '.', ''), ',', '') ;
Running the UPDATE
could take a while the first time, but you could also set an ON UPDATE
and an ON INSERT
triggers on company
so that stripped_name
gets populated and update on-the-fly.
Upvotes: 6
Reputation: 8090
Try starting from the tmp table because foreach row in company a tmp table will be created :
SELECT C.id,
C.name,
C.address,
C.city_id
FROM (SELECT name
FROM company
GROUP BY name
HAVING Count(id) > 1) D
INNER JOIN company C
ON Replace(Replace(C.name, '.', ''), ',', '') =
Replace(Replace(D.name, '.', ''), ',', '')
Upvotes: 0