MaiaVictor
MaiaVictor

Reputation: 52957

Data structure for retrieving strings that are close by Levenshtein distance

For example, starting with the set of english words, is there a structure/algorithm that allows one fast retrieval of strings such as "light" and "tight", using the word "right" as the query? I.e., I want to retrieve strings with small Levenshtein distance to the query string.

Upvotes: 9

Views: 2232

Answers (3)

templatetypedef
templatetypedef

Reputation: 372664

The BK-tree data structure might be appropriate here. It's designed to efficiently support queries of the form "what are all words within edit distance k or less from a query word?" Its performance guarantees are reasonably good, and it's not too difficult to implement.

Hope this helps!

Upvotes: 5

us2012
us2012

Reputation: 16253

Since calculating Levenshtein distance is O(nm) for strings of length n and m, the naive approach of calculating all Levenshtein distances L(querystring, otherstring) is very expensive.

However, if you visualize the Levenshtein algorithm, it basically fills an n*m table with edit distances. But for words that start with the same few letters (prefix), the first few rows of the Levenshtein tables will be the same. (Fixing the query string, of course.)

This suggests using a trie (also called prefix tree): Read the query string, then build a trie of Levenshtein rows. Afterwards, you can easily traverse it to find strings close to the query string.

(This does mean that you have to build an new trie for a new query string. I don't think there is a similarly intriguing structure for all-pairs distances.)

I thought I recently saw an article about this with a nice python implementation. Will add a link if I can find it. Edit: Here it is, on Steve Hanov's blog.

Upvotes: 1

kufudo
kufudo

Reputation: 2833

I'm thinking the fastest way would be to pre-build a cache of similarities which you can index and access in O(1) time. The trick would be to find common misspellings to add to your cache, which could get pretty large.

I imagine Google would do something similar using their wide range of statistical query search data.

Upvotes: 0

Related Questions