Slam
Slam

Reputation: 3325

MySQL optimize query for "fuzzy matching" duplicates?

I'm cleaning up a dirty database I inherited and need to "fuzzy match" names for human review. I've come up with a solution that works but it terribly slow — 7 minutes on 15k rows. I have the feeling I'm overlooking some really simple solution.

Example records:

1  John Smith
2  John Q Smith
3  Janway Smith 
4  Jane Chen
5  David Jones
6  Natalia La Brody
7  Natalia LaBrody
8  LaBrody
9  Dave Jones

I need multiple criteria for this fuzzy match. Two I've come up with include:

  1. Check for matches based on a concat of the first three and last five letters.
  2. If a single word check against all last words
  3. (I may add more conditions)

My code looks like this:

UPDATE authors a
INNER JOIN (SELECT id, author_name FROM authors) b
    ON CASE WHEN a.author_name NOT REGEXP ' '
        THEN 
            a.author_name = 
            substring_index(b.author_name, ' ', -1) 
        ELSE 
            concat(LEFT(a.author_name, 3), RIGHT(a.author_name, 5)) = 
            concat(LEFT(b.author_name, 3), RIGHT(b.author_name, 5))
        END 
SET tags = concat_ws(',',tags,'Duplicate?')
WHERE a.id <> b.id

I was surprised I could put a CASE in an ON clause but it worked. Still, how can I do this with substantially better performance?

Upvotes: 0

Views: 2015

Answers (2)

sumit
sumit

Reputation: 15464

One approach is to use soundex . You cannot 100% rely on it but it help you to narrow down your search result and make the query fast

select t, soundex(t) from 
 (
 select 'John Smith' as t
 union 
 select 'John Q Smith' as t
 union 
 select 'Janway Smith'  as t
 union 
 select 'Jane Chen'  as t
 union
 select 'David Jones'  as t
 union
 select 'Natalia La Brody'  as t
 union
 select 'Natalia LaBrody'  as t
 union
 select 'LaBrody'  as t
 union 
 select 'dave jones' as t
 )tbl
group by soundex(t)

output

'Natalia La Brody', 'N34163'
'LaBrody', 'L163'
'John Smith', 'J5253'
'Jane Chen', 'J525'
'David Jones', 'D13252'
'dave jones', 'D1252'

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269973

Databases (in general) are not designed for this purpose.

One algorithm that gets used is Levenshtein distance. You can readily find implementations for MySQL, but that doesn't help your problem.

To be honest, such string matching often requires manual checking. You might consider just loading the data into a spreadsheet, ordering them alphabetically, and noting in the spreadsheet when values are the same.

In the end, you are going to have to spend a lot of time figuring out where the "duplicates" are, so you might as well plan your workload around that.

Upvotes: 0

Related Questions