Reputation: 2484
Suppose I have a large list of words. (about 4-5 thousands and increasing). Someone searched for a string. Unfortunately, The string was not found on the wordlist. Now what would be the best and optimized way to find words similar to the input string? The first thing that came to my mind was calculating Levenshtein distance between each entry of the wordlist and the input string. But is that the optimized way to do that?
(Note that, this is not language-specific question)
Upvotes: 1
Views: 154
Reputation: 4348
EDIT: new solution
Yes, calculating Levenshtein distances between your input and the word list can be a reasonable approach, but takes a lot of time. BK-trees can improve this, but they become slow quickly as the Levenshtein distance becomes bigger. It seems we can speed up the Levenshtein distance calculations using a trie, as described in this excellent blog post:
Fast and Easy Levenshtein distance using a Trie
It relies on the fact that the dynamic programming lookup table for Levenshtein distance has common rows in different invocations i.e. levenshtein(kate,cat)
and levenshtein(kate,cats)
.
Running the Python program given on that page with the TWL06 dictionary gives:
> python dict_lev.py HACKING 1
Read 178691 words into 395185 nodes
('BACKING', 1)
('HACKING', 0)
('HACKLING', 1)
('HANKING', 1)
('HARKING', 1)
('HAWKING', 1)
('HOCKING', 1)
('JACKING', 1)
('LACKING', 1)
('PACKING', 1)
('SACKING', 1)
('SHACKING', 1)
('RACKING', 1)
('TACKING', 1)
('THACKING', 1)
('WHACKING', 1)
('YACKING', 1)
Search took 0.0189998 s
That's really fast, and would be even faster in other languages. Most of the time is spent in building the trie, which is irrelevant as it needs to be done just once and stored in memory.
The only minor downside to this is that tries take up a lot of memory (which can be reduced with a DAWG, at the cost of some speed).
Another approach: Peter Norvig has an great article (with complete source code) on spelling correction.
http://norvig.com/spell-correct.html
The idea is to build possible edits of the words, and then choose the most likely spelling correction of that word.
Upvotes: 1
Reputation: 20348
If you are interested in the very code itself I implemented an algorithm for finding optimal alignment between two strings. It basically shows how to transform one string into the other with k
operations (where k
is Levenstein/Edit Distance
of the strings). It could be bit simplified for your needs (as you only need the distance itself). By the way it works in O(mn)
where m
and n
are lengths of the strings. My implementation is based on: this and this.
#optimalization: using int instead of strings:
#1 ~ "left", "insertion"
#7 ~ "up", "deletion"
#17 ~ "up-left", "match/mismatch"
def GetLevenshteinDistanceWithBacktracking(sequence1,sequence2):
distances = [[0 for y in range(len(sequence2)+1)] for x in range(len(sequence1)+1)]
backtracking = [[1 for y in range(len(sequence2)+1)] for x in range(len(sequence1)+1)]
for i in range(1, len(sequence1)+1):
distances[i][0]=i
for i in range(1, len(sequence2)+1):
distances[0][i]=i
for j in range(1, len(sequence2)+1):
for i in range(1, len(sequence1)+1):
if sequence1[i-1] == sequence2[j-1]:
distances[i][j]=distances[i-1][j-1]
backtracking[i][j] = 17
else:
deletion = distances[i-1][j]+1
substitution = distances[i-1][j-1]+1
insertion = distances[i][j-1] + 1
distances[i][j]=min( deletion, substitution, insertion)
if distances[i][j] == deletion:
backtracking[i][j] = 7
elif distances[i][j] == insertion:
backtracking[i][j] = 1
else:
backtracking[i][j] = 17
return (distances[len(sequence1)][len(sequence2)], backtracking)
def Alignment(sequence1, sequence2):
cost, backtracking = GetLevenshteinDistanceWithBacktracking(sequence1, sequence2)
alignment1 = alignment2 = ""
i = len(sequence1)
j = len(sequence2)
#from backtracking-matrix we get optimal-alignment
while(i > 0 or j > 0):
if i > 0 and j > 0 and backtracking[i][j] == 17:
alignment1 = sequence1[i-1] + alignment1
alignment2 = sequence2[j-1] + alignment2
i -= 1
j -= 1
elif i > 0 and backtracking[i][j] == 7:
alignment1 = sequence1[i-1] + alignment1
alignment2 = "-" + alignment2
i -= 1
elif j > 0 and backtracking[i][j]==1:
alignment2 = sequence2[j-1] + alignment2
alignment1 = "-" + alignment1
j -= 1
elif i > 0:
alignment1 = sequence1[i-1] + alignment1
alignment2 = "-" + alignment2
i -= 1
return (cost, (alignment1, alignment2))
It depends on broader context and how accurate you want to be. But what I would (probably) start with:
max. distance
has been exceeded.machine learning
approach.Upvotes: 1
Reputation: 1543
I think that something better than this exists, but BK trees are a good optimization from brute force at least.
It uses the property of Levenshtein distance being a metric space, and hence if you get a Levenshtein distance of d
between your query
and an arbitrary string s
from the dict, then all your results must be at a distance from (d+n) to (d-n)
to s
. Here n is the maximum Levenshtein distance from the query you want to output.
It's explained in detail here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
Upvotes: 1