Reputation:
I am looking for such an algorithm, but one that makes substitutions between words and not between letters. Is there such an algorithm?
I am looking for an implementation with SQL Server, but the name of the algorithm will be good enough.
Upvotes: 0
Views: 2018
Reputation: 2295
If you need to find misspellings while re-arranging words, I find that taking the word with least spaces and permuting its split array back into strings of rearranged same words in different order, then running Levenshtein on the new words and the second phrase works great!
Harward Medical School (2 spaces, 6 permutations) Harward School of Medicine (3 spaces, 24 permutations) Levenshtein distance - 17 Compare to, say, "Stanford School of Medicine" (Levenshtein distance 5)
Harward School Medical vs. Harward School of Medicine has LD of 6. Still allowing a mistake to Stanford, but MUCH MUCH CLOSER up the rank, so you can build in things like dropping words (dropping "of" in this case gets LD of only 3)
Performance of permuting words by space is O(n!), dropping words is O(n+m) where n,m are numbers of words in each phrase, unless you want to drop more than one word, or vice versa, you can only opt to drop words with fewer than 4 letters, or words from preposition/adjective lists, etc.
Performance of Levenshtein is explained on its wikipedia page.
Upvotes: 0
Reputation: 101149
As far as I'm aware, there's no reason you can't just use the existing Levenshtein algorithm - just use words as symbols instead of letters.
Upvotes: 1