Reputation: 1168
How can I use a spellcheck to correctly identify that a word is missing ñ?
I have tried to use autocorrect
, but it will not detect that the ñ is missing
from autocorrect import Speller
spell = Speller(lang='es')
print(spell('gatto'))
print(spell('ano'))
print(spell('manana'))
gato
ano
manana
I have also tried spellchecker
but that does not detect the word is spelt wrong
from spellchecker import SpellChecker
spell = SpellChecker(language='es')
misspelled = ["gatto", "manana", "ano"]
misspelled = spell.unknown(misspelled)
for word in misspelled:
print(word, spell.correction(word))
gatto gato
Upvotes: 1
Views: 33
Reputation: 198304
The data for autocorrect
lists manana
as a correct word, which is why it is not getting corrected. ano
is a valid word with a somewhat different meaning from año
, and a simple spellchecker can't know you don't mean that. gatto
doesn't, and shouldn't, include ñ
. However:
spell("jalapeno")
# => jalapeño
Now, as to why manana
is in the dictionary, I can't know for sure — that is a question either for the native speakers, or for the person who created the frequency data that the module uses. According to that data (version downloaded at the time of this answer), mañana
was found 40238
times, and manana
, 1853
— much less common, but existing. Similarly, España
is 1356943
and Espana
is 8297
.
The way autocorrect
package works is, if a word being tested is itself a candidate (i.e. if it was found in the frequency list), it is unchanged. If not, then the most frequent among the one-typo candidates is returned. If that too fails, and fast=False
, then the two-typo candidates are checked. Since manana
is itself in the word list, even though it is much less common than mañana
, it will be returned.
The README.md
of the autocorrect
does not specify which dataset was used to count the word frequencies, but suggests for new languages that to get "a bunch of text", "Easiest way is to download wikipedia." If Wikipedia was indeed used, then if Spanish Wikipedia includes the word manana
even once, it will not be autocorrected, since it will be considered correct.
As to solutions:
You might create a new frequency list (according to the package's instructions) from a text corpus that you know to not include incorrect words. I am sure the author would value the pull request.
You might use a different spellchecker, for example aspell
. The Python package relies on the aspell
program as well as the appropriate language file being installed on your system. I have not tried using the Spanish aspell, but I believe its dictionary is likely to be more correct.
You might use a wordlist you know to be correct to prune the autocorrect
package's word frequencies. To find the latter, use this in Python:
import os
import autocorrect
os.path.join(os.path.dirname(autocorrect.__file__), 'data', 'es.tar.gz')
Upvotes: 1