Reputation: 3325
I'm cleaning up a dirty database I inherited and need to "fuzzy match" names for human review. I've come up with a solution that works but it terribly slow — 7 minutes on 15k rows. I have the feeling I'm overlooking some really simple solution.
Example records:
1 John Smith
2 John Q Smith
3 Janway Smith
4 Jane Chen
5 David Jones
6 Natalia La Brody
7 Natalia LaBrody
8 LaBrody
9 Dave Jones
I need multiple criteria for this fuzzy match. Two I've come up with include:
My code looks like this:
UPDATE authors a
INNER JOIN (SELECT id, author_name FROM authors) b
ON CASE WHEN a.author_name NOT REGEXP ' '
THEN
a.author_name =
substring_index(b.author_name, ' ', -1)
ELSE
concat(LEFT(a.author_name, 3), RIGHT(a.author_name, 5)) =
concat(LEFT(b.author_name, 3), RIGHT(b.author_name, 5))
END
SET tags = concat_ws(',',tags,'Duplicate?')
WHERE a.id <> b.id
I was surprised I could put a CASE in an ON clause but it worked. Still, how can I do this with substantially better performance?
Upvotes: 0
Views: 2015
Reputation: 15464
One approach is to use soundex . You cannot 100% rely on it but it help you to narrow down your search result and make the query fast
select t, soundex(t) from
(
select 'John Smith' as t
union
select 'John Q Smith' as t
union
select 'Janway Smith' as t
union
select 'Jane Chen' as t
union
select 'David Jones' as t
union
select 'Natalia La Brody' as t
union
select 'Natalia LaBrody' as t
union
select 'LaBrody' as t
union
select 'dave jones' as t
)tbl
group by soundex(t)
output
'Natalia La Brody', 'N34163'
'LaBrody', 'L163'
'John Smith', 'J5253'
'Jane Chen', 'J525'
'David Jones', 'D13252'
'dave jones', 'D1252'
Upvotes: 1
Reputation: 1269973
Databases (in general) are not designed for this purpose.
One algorithm that gets used is Levenshtein distance. You can readily find implementations for MySQL, but that doesn't help your problem.
To be honest, such string matching often requires manual checking. You might consider just loading the data into a spreadsheet, ordering them alphabetically, and noting in the spreadsheet when values are the same.
In the end, you are going to have to spend a lot of time figuring out where the "duplicates" are, so you might as well plan your workload around that.
Upvotes: 0