Is there an in-built method in nltk to find words/phrases that closely match the given word?

Question

The speech recognition software that I'm using gives less than optimal results.

Eg: session is returned as fashion or mission.

Right now I have a dictionary like:

matches = {
  'session': ['fashion', 'mission'],
  ...
}

and I am looping over all the words to find a match.

I do not mind false positives as the application accepts only a limited set of keywords. However it is tedious to manually enter new words for each of them. Also, the the speech recognizer comes up with new words every time I speak.

I am also running into difficulties where a long word is returned as a group of smaller words, so the above approach won't work.

So, is there an in-built method in nltk to do this? Or even a better algorithm that I could write myself?

bkrn · Accepted Answer

You may want to look into python-Levenshtein. It's a python C extension module for calculating string distances/similarities.

Something like this silly inefficient code might work:

from Levenshtein import jaro_winkler  # May not be module name

heard_words = "brain"
possible_words = ["watermelon", "brian"]

word_scores = [jaro-winkler(heard_word, possible) for possible in possible_words]
guessed_word = possible_words[word_scores.index(max(word_scores))]

print('I heard {0} and guessed {1}'.format(heard_word, guessed_word))

Here's the documentation and a non-maintained repo.

Is there an in-built method in nltk to find words/phrases that closely match the given word?

Answers (2)

Related Questions