LucyL
LucyL

Reputation: 91

why does symspellpy change "pediatrition" to "media tuition" instead of "pediatrician"?

I'm trying to perform spell correction on the free-text entered by users. It looks like symspellpy change "pediatrition" to "media tuition" instead of pediatrician, change "news achor" to "news actor" instead of "news anchor"? Is there any way to get symspellpy to auto spell correct pediatrition to pediatrician instead of "media tuition"? Below is my code based on some of the examples I found online:

max_edit_distance_dictionary = 2
prefix_length = 7
max_edit_distance_lookup = 2

sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)

dictionary_path = pkg_resources.resource_filename("symspellpy", "frequency_dictionary_en_82_765.txt")
bigram_path = pkg_resources.resource_filename("symspellpy", "frequency_bigramdictionary_en_243_342.txt")

if not sym_spell.load_dictionary(dictionary_path, term_index=0,count_index=1):
    print("Dictionary file not found")
if not sym_spell.load_bigram_dictionary(bigram_path, term_index=0,count_index=2):
    print("Bigram dictionary file not found")


input_term = 'pediatrition'
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance=2,
                                        transfer_casing=True)
for suggestion in suggestions:
    print(suggestion)

Upvotes: 0

Views: 612

Answers (1)

Wolf Garbe
Wolf Garbe

Reputation: 184

pediatrition

media tuition : edit distance=3

pediatrician : edit distance=2

The problem is that the word "pediatrician" is just not contained in the used sample dictionary, so SymSpell doesn't know about the suggestions. This can be fixed by either using a more complete dictionary, or by adding the word to the dictionary with a text editor or adding the word programmatically with CreateDictionaryEntry().

news acor

news actor : edit distance=1

news anchor : edit distance=2

The problem here is that the suggestion "news actor" has a smaller edit distance than "news anchor". SymSpell always chooses the suggestion with the lowest edit distance, and only if there are multiple suggestions with the same edit distance it is using the Naive Bayes probability to determine the most likely suggestion.

Upvotes: 1

Related Questions