How to find rows in a table with similar string values

I have a Microsoft SQL Server database table with around 7 million crowd-sourced records, primarily containing a string name value with some related details. For nearly every record it seems there are a dozen similar typo records and I am trying to do some fuzzy matching to identify record groups such as "Apple", "Aple", "Apples", "Spple", etc. These names can also contain multiple words with spaces between them.

I've come up with a solution of using an edit-distance scalar function that returns number of keystrokes required for transformation from string1 to string2 and using that function to join the table to itself. As you can imagine, this doesn't perform that well since its having to execute the function millions of times to evaluate a join.

So I put that in a cursor so at least only one string1 is being evaluated at a time, this at least gets results coming out but after letting it run for weeks it has only made it through evaluating 150,000 records. With 7 million to evaluate, I don't think I have the kind of time my method is going to take.

I put full text indexes on the string names, but couldn't really find a way to use the full text predicates when I didn't have a static value I was searching.

Any ideas how I could do something like the following in a way that wouldn't take months to run?

 SELECT t1.name, t2.name
 FROM names AS t1
 INNER JOIN names AS t2
      ON EditDistance(t1.name,t2.name) = 1
      AND t1.id != t2.id

Upvotes: 3

Answers (2)

fgregg

Reputation: 3249

You need to find a way to avoid comparing each record to every other record. If you are just using a single field, you can use a special data structure like a trie, for example https://github.com/mattandahalfew/Levenshtein_search

Upvotes: 1

Joey

Reputation: 670

You may use the DIFFERENCE ( character_expression , character_expression ) function to evaluate the difference in the SOUNDEX code for each character expression. The SOUNDEX code is used to evaluate the difference between strings.

DIFFERENCE will return an integer of 0 (the highest possible difference) and 4 (the least possible difference). You could utilize this value to determine how closely matched the strings are (e.g. a condition similar to DIFFERENCE(column1, column2) > 3 would match records where the SOUNDEX values of column1 and column2 are off by 1).

Here is a link to the documentation of the DIFFERENCE function: https://technet.microsoft.com/en-us/library/ms188753(v=sql.105).aspx

Upvotes: 1

How to find rows in a table with similar string values

Answers (2)

Related Questions