Reputation: 195
I have a list of word library and a text in which there are a spell error (typos), and I want to correct the word spell error to be correct according to list of library
for example
in list of word :
listOfWord = [...,"halo","saya","sedangkan","semangat","cemooh"..];
this is my string :
string = "haaallllllooo ssya sdngkan ceemoooh , smngat semoga menyenangkan"
I want change the spellerror to be correct like :
string = "halo saya sedangkan cemooh, semangat semoga menyenangkan"
what is the best algorithm to check each word in list, because I have millions of words in the list and have many possibilities
Upvotes: 8
Views: 10260
Reputation: 30605
You can use difflib
's get close matches, though it is not that efficient.
words = ["halo","saya","sedangkan","semangat","cemooh"]
def get_exact_words(input_str):
exact_words = difflib.get_close_matches(input_str,words,n=1,cutoff=0.7)
if len(exact_words)>0:
return exact_words[0]
else:
return input_str
string = "haaallllllooo ssya sdngkan ceemoooh , smngat semoga menyenangkan"
string = string.split(' ')
exact = [get_exact_words(word) for word in string]
exact = ' '.join(exact)
print(exact)
Output :
With difflib
haaallllllooo saya sedangkan cemooh , semangat semangat menyenangkan
Upvotes: 5
Reputation: 179
I think you should apply string distance algorithms with a word to find nearest. You can apply these algorithms to find the nearest word. Those are mostly O(n) algorithms so at the end your sentence replacement would cost you O(n) at most.
Upvotes: -1
Reputation: 40838
You can use pyenchant to check spelling with your list of words.
>>> import enchant
>>> d = enchant.request_pwl_dict("mywords.txt")
>>> d.check('helo')
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
You need to split your words and check one-by-one, choose the first suggestion to replace if it's false. There are more advance features in the tutorial here. http://pyenchant.readthedocs.io/en/latest/tutorial.html
Upvotes: 0
Reputation: 2082
You can use hashing techniques for checking correct pattern, something on the lines of Rabin Karp Algorithm.
You know what would be the hash value of your original strings in the list. For spell correction, you can try the combination of those words that gives you same hash value before matching them with original string present in the dictionary. This would require, anyways, to iterate through all the characters in the spellerror list only once. But it will be efficient.
Upvotes: 1
Reputation: 37172
I am assuming you are writing spell checker for some language.
You might want tokenize the sentence into words.
Then shorten words like haaallllllooo
to haalloo
. Assuming the language you have doesn't have words that have many repeated letters too often. Easy to check since you have the dictionary.
Then you can use this algorithm/implementation by Peter Norvig. All you have to do is to replace his dictionary of correct words with your dictionary.
Upvotes: 1
Reputation: 744
It depends on how your data is stored, but you'll probably want to use a pattern matching algorithm like Aho–Corasick. Of course, that assumes your input data structure is a Trie. A Trie a very space-efficient storage container for words that may also be of interest to you (again, depending on your environment.)
Upvotes: 7