Reputation: 1946
I have a Microsoft SQL Server database table with around 7 million crowd-sourced records, primarily containing a string name value with some related details. For nearly every record it seems there are a dozen similar typo records and I am trying to do some fuzzy matching to identify record groups such as "Apple", "Aple", "Apples", "Spple", etc. These names can also contain multiple words with spaces between them.
I've come up with a solution of using an edit-distance scalar function that returns number of keystrokes required for transformation from string1 to string2 and using that function to join the table to itself. As you can imagine, this doesn't perform that well since its having to execute the function millions of times to evaluate a join.
So I put that in a cursor so at least only one string1 is being evaluated at a time, this at least gets results coming out but after letting it run for weeks it has only made it through evaluating 150,000 records. With 7 million to evaluate, I don't think I have the kind of time my method is going to take.
I put full text indexes on the string names, but couldn't really find a way to use the full text predicates when I didn't have a static value I was searching.
Any ideas how I could do something like the following in a way that wouldn't take months to run?
SELECT t1.name, t2.name
FROM names AS t1
INNER JOIN names AS t2
ON EditDistance(t1.name,t2.name) = 1
AND t1.id != t2.id
Upvotes: 3
Views: 4779
Reputation: 3249
You need to find a way to avoid comparing each record to every other record. If you are just using a single field, you can use a special data structure like a trie, for example https://github.com/mattandahalfew/Levenshtein_search
Upvotes: 1
Reputation: 670
You may use the DIFFERENCE ( character_expression , character_expression )
function to evaluate the difference in the SOUNDEX
code for each character expression. The SOUNDEX
code is used to evaluate the difference between strings.
DIFFERENCE
will return an integer of 0 (the highest possible difference) and 4 (the least possible difference). You could utilize this value to determine how closely matched the strings are (e.g. a condition similar to DIFFERENCE(column1, column2) > 3
would match records where the SOUNDEX
values of column1
and column2
are off by 1).
Here is a link to the documentation of the DIFFERENCE
function: https://technet.microsoft.com/en-us/library/ms188753(v=sql.105).aspx
Upvotes: 1