Reputation: 6340
The speech recognition software that I'm using gives less than optimal results.
Eg: session
is returned as fashion
or mission
.
Right now I have a dictionary like:
matches = {
'session': ['fashion', 'mission'],
...
}
and I am looping over all the words to find a match.
I do not mind false positives as the application accepts only a limited set of keywords. However it is tedious to manually enter new words for each of them. Also, the the speech recognizer comes up with new words every time I speak.
I am also running into difficulties where a long word is returned as a group of smaller words, so the above approach won't work.
So, is there an in-built method in nltk to do this? Or even a better algorithm that I could write myself?
Upvotes: 3
Views: 3549
Reputation: 3165
You can use the fuzzywuzzy,a python package for fuzzy matching of words and strings.
To install the package.
pip install fuzzywuzzy
Sample code related to your question.
from fuzzywuzzy import fuzz
MIN_MATCH_SCORE = 80
heard_word = "brain"
possible_words = ["watermelon", "brian"]
guessed_word = [word for word in possible_words if fuzz.ratio(heard_word, word) >= MIN_MATCH_SCORE]
print 'I heard {0} and guessed {1}'.format(heard_word, guessed_word)
Here is the documentation and repo of the fuzzywuzzy.
Upvotes: 3
Reputation: 36
You may want to look into python-Levenshtein. It's a python C extension module for calculating string distances/similarities.
Something like this silly inefficient code might work:
from Levenshtein import jaro_winkler # May not be module name
heard_words = "brain"
possible_words = ["watermelon", "brian"]
word_scores = [jaro-winkler(heard_word, possible) for possible in possible_words]
guessed_word = possible_words[word_scores.index(max(word_scores))]
print('I heard {0} and guessed {1}'.format(heard_word, guessed_word))
Here's the documentation and a non-maintained repo.
Upvotes: 2