Megetron
Megetron

Reputation:

Damerau-Levenshtein distance for words

I am looking for such an algorithm, but one that makes substitutions between words and not between letters. Is there such an algorithm?

I am looking for an implementation with SQL Server, but the name of the algorithm will be good enough.

Upvotes: 0

Views: 2018

Answers (2)

Gregory Klopper
Gregory Klopper

Reputation: 2295

  1. split both strings by by space
  2. create a word hash table by going through words in both strings (concatenate arrays which you get from split operation into one, if you want), each new word gets an incrementally new number
  3. use Levenshtein algorithm on the 2 resulting arays of numbers (but without converting them back to strings, since 11,12 -> "1112" -> 1,1,1,2

If you need to find misspellings while re-arranging words, I find that taking the word with least spaces and permuting its split array back into strings of rearranged same words in different order, then running Levenshtein on the new words and the second phrase works great!

Harward Medical School (2 spaces, 6 permutations) Harward School of Medicine (3 spaces, 24 permutations) Levenshtein distance - 17 Compare to, say, "Stanford School of Medicine" (Levenshtein distance 5)

Harward School Medical vs. Harward School of Medicine has LD of 6. Still allowing a mistake to Stanford, but MUCH MUCH CLOSER up the rank, so you can build in things like dropping words (dropping "of" in this case gets LD of only 3)

Performance of permuting words by space is O(n!), dropping words is O(n+m) where n,m are numbers of words in each phrase, unless you want to drop more than one word, or vice versa, you can only opt to drop words with fewer than 4 letters, or words from preposition/adjective lists, etc.

Performance of Levenshtein is explained on its wikipedia page.

Upvotes: 0

Nick Johnson
Nick Johnson

Reputation: 101149

As far as I'm aware, there's no reason you can't just use the existing Levenshtein algorithm - just use words as symbols instead of letters.

Upvotes: 1

Related Questions