Sam Phippen
Sam Phippen

Reputation: 146

Finding nearest string to a pair in a lexicon

I am currently trying to come up with an efficient solution to the problem with the following formulation:

Given an input string s and a fixed lexicon find a string w1||w2 (|| denotes concatenation, w1 and w2 are words in the lexicon) with the lowest levenshtein distance to s.

The obvious naive solution is:

for word1 in lexicon:
   for word2 in lexicon:
       if lev_dist(word1 + word2) < lev_dist(lowest):
          lowest = word1 + word2

I'm sure there must be better solutions to the problem. Can anyone offer any insight?

Upvotes: 1

Views: 193

Answers (3)

Ivan Vergiliev
Ivan Vergiliev

Reputation: 3841

If you are running lots of queries on the same lexicon and want to improve the query time, but can afford some time for preprocessing, you can create a trie containing all possible words in the form w1 || w2. Then you can use the algorithm described here: Fast and Easy Levenshtein distance using a Trie to find the answer for any word you need.

What the algorithm does is basically walking the nodes of the trie and keeping track of the current minimum. Then if you end up in some node and the Levenshtein distance (of the word from the root to the current node and the input string s) is already larger than the minimum achieved so far, you can prune the entire subtree rooted at this node, because it cannot yield an answer.

In my testing with a dictionary of English words and random query words, this is anywhere between 30 and 300 times faster than the normal approach of testing every word in the dictionary, depending on the type of queries you run on it.

Upvotes: 0

Beth
Beth

Reputation: 4670

I assume you want to do this many times for the same lexicon. You've a misspelled word and suspect it's caused by the lack of a space between them, for example.

The first thing you'll surely need is a way to estimate string "closeness". I'm fond of normalization techniques. For example, replace each letter by a representative from an equivalence class. (Perhaps M and N both go to M because they sound similar. Perhaps PH --> F for a similar reason.)

Now, you'll want your normalized lexicon entered both frontwards and backwards into a trie or some similar structure.

Now, search for your needle both forwards and backwards, but keep track of intermediate results for both directions. In other words, at each position in the search string, keep track of the list of candidate trie nodes which have been selected at that position.

Now, compare the forwards- and backwards-looking arrays of intermediate results, looking for places that look like a good join point between words. You might also check for join points off-by-one from each other. (In other words, you've found the end of the first word and the beginning of the second.)

If you do, then you've found your word pair.

Upvotes: 0

mcdowella
mcdowella

Reputation: 19611

You may be able to do a bit better by putting lower bounds on the cost of individual strings.

Looking at the algorithm in http://en.wikipedia.org/wiki/Levenshtein_distance, at the time you care computing d[i, j] for the distance you know you are adding in a contribution that depends on s[i] and t[j], where s and t are the strings being compared, so you can make the costs of change/delete/insert depend on the position of the operation within the two strings.

This means that you can compute the distance between abcXXX and abcdef using a cost function in which operations on the characters marked XXX are free. This allows you to compute the cost of transforming abcXXX to abcdef if the string XXX is in fact the most favourable string possible.

So for each word w1 in the lexicon compute the distance between w1XXX and the target string and XXXw1 and the target string. Produce two copies of the lexicon, sorted in order of w1XXX distance and XXXw1 distance. Now try all pairs in order of the sum of left hand and right hand costs, which is a lower bound on the cost of that pair. Keep track of the best answer so far. When the best answer is at least as good as the next lower bound cost you encounter, you know that nothing you can try can improve on this best answer, so you can stop.

Upvotes: 1

Related Questions