Reputation: 9945
We have spellchecker implementation based on Levenshtein distance. As we couldn't calculate distance for all possible substitutions (Levenshtein distance between two string calculated in O(n^2)
) we use K-gram index for retrieving candidates for substitution.
So K-gram index is just one of the ways for fast eliminating irrelevant substitution. I'm interested in other ways as well. At the moment we used several more tricks. Considering that we are only interested in substitutions of edit distance no more the d from the original string we could use following rules:
k
k-grams. So the strings with count difference of k-grams k * d
could not have edit distance less than d: .Are those assumptions correct? And what other ways of substitutions eliminating exists applicable to spellcheckers?
Upvotes: 1
Views: 360
Reputation: 3195
in my experience the k-gram approximation leaves a lot to be desired (it excludes many relevant results).
instead put your terms in an automaton/transducer, trie, or even a sorted array can suffice, and find the true levenshtein matches via intersection.
its intuitive if you think about it: if you only want words within a distance of 1, and the input term is "foo", there is no sense in examining "bar", "baz" etc when examining the 'b' nodes. only boo, bfoo, etc stand a chance so you can restrict the search to only prefixes that can possibly lead to a final state.
so you just create an automaton that accepts all words within k edit distances of "foo" and then intersect this automaton with your dictionary automaton/trie/whatever.
you can compute these these DFAs extremely efficiently, avoiding any slow NFA-DFA determinization, etc:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
Upvotes: 0
Reputation: 6753
you can use the simple rule to restrict the search to dictionary terms beginning with the same letter as the query string. The hope is that users do not misspell the first letter.
Also, you can use a permuterm index. consider all rotations of the query and traverse the B tree to find any dictionary terms that match any rotation. You can also refine this rotation scheme by omitting a suffix of l characters before performing the traversal
Upvotes: 1