Reputation: 52957
For example, starting with the set of english words, is there a structure/algorithm that allows one fast retrieval of strings such as "light" and "tight", using the word "right" as the query? I.e., I want to retrieve strings with small Levenshtein distance to the query string.
Upvotes: 9
Views: 2232
Reputation: 372664
The BK-tree data structure might be appropriate here. It's designed to efficiently support queries of the form "what are all words within edit distance k or less from a query word?" Its performance guarantees are reasonably good, and it's not too difficult to implement.
Hope this helps!
Upvotes: 5
Reputation: 16253
Since calculating Levenshtein distance is O(nm)
for strings of length n and m, the naive approach of calculating all Levenshtein distances L(querystring, otherstring)
is very expensive.
However, if you visualize the Levenshtein algorithm, it basically fills an n*m table with edit distances. But for words that start with the same few letters (prefix), the first few rows of the Levenshtein tables will be the same. (Fixing the query string, of course.)
This suggests using a trie (also called prefix tree): Read the query string, then build a trie of Levenshtein rows. Afterwards, you can easily traverse it to find strings close to the query string.
(This does mean that you have to build an new trie for a new query string. I don't think there is a similarly intriguing structure for all-pairs distances.)
I thought I recently saw an article about this with a nice python implementation. Will add a link if I can find it. Edit: Here it is, on Steve Hanov's blog.
Upvotes: 1
Reputation: 2833
I'm thinking the fastest way would be to pre-build a cache of similarities which you can index and access in O(1) time. The trick would be to find common misspellings to add to your cache, which could get pretty large.
I imagine Google would do something similar using their wide range of statistical query search data.
Upvotes: 0