chadb
chadb

Reputation: 1168

How can I use a spellchecker to add back a missing ñ?

How can I use a spellcheck to correctly identify that a word is missing ñ?

I have tried to use autocorrect, but it will not detect that the ñ is missing

from autocorrect import Speller
spell = Speller(lang='es')

print(spell('gatto'))
print(spell('ano'))
print(spell('manana'))

gato 
ano
manana

I have also tried spellchecker but that does not detect the word is spelt wrong

from spellchecker import SpellChecker
spell = SpellChecker(language='es')
misspelled = ["gatto", "manana", "ano"]
misspelled = spell.unknown(misspelled)
for word in misspelled:
    print(word, spell.correction(word))

gatto gato

Upvotes: 1

Views: 33

Answers (1)

Amadan
Amadan

Reputation: 198304

The data for autocorrect lists manana as a correct word, which is why it is not getting corrected. ano is a valid word with a somewhat different meaning from año, and a simple spellchecker can't know you don't mean that. gatto doesn't, and shouldn't, include ñ. However:

spell("jalapeno")
# => jalapeño

Now, as to why manana is in the dictionary, I can't know for sure — that is a question either for the native speakers, or for the person who created the frequency data that the module uses. According to that data (version downloaded at the time of this answer), mañana was found 40238 times, and manana, 1853 — much less common, but existing. Similarly, España is 1356943 and Espana is 8297.

The way autocorrect package works is, if a word being tested is itself a candidate (i.e. if it was found in the frequency list), it is unchanged. If not, then the most frequent among the one-typo candidates is returned. If that too fails, and fast=False, then the two-typo candidates are checked. Since manana is itself in the word list, even though it is much less common than mañana, it will be returned.

The README.md of the autocorrect does not specify which dataset was used to count the word frequencies, but suggests for new languages that to get "a bunch of text", "Easiest way is to download wikipedia." If Wikipedia was indeed used, then if Spanish Wikipedia includes the word manana even once, it will not be autocorrected, since it will be considered correct.

As to solutions:

  • You might create a new frequency list (according to the package's instructions) from a text corpus that you know to not include incorrect words. I am sure the author would value the pull request.

  • You might use a different spellchecker, for example aspell. The Python package relies on the aspell program as well as the appropriate language file being installed on your system. I have not tried using the Spanish aspell, but I believe its dictionary is likely to be more correct.

  • You might use a wordlist you know to be correct to prune the autocorrect package's word frequencies. To find the latter, use this in Python:

    import os
    import autocorrect
    os.path.join(os.path.dirname(autocorrect.__file__), 'data', 'es.tar.gz')
    

Upvotes: 1

Related Questions