Reputation: 139
I'm attempting to take large amounts of natural language from a web forum and correct the spelling with PyEnchant. The text is often informal, and about medical issues, so I have created a text file "test.pwl" containing relevant medical words, chat abbreviations, and so on. In some cases, little bits of html, urls, etc do unfortunately remain in it.
My script is designed to use both the en_US dictionary and the PWL to find all misspelled words and correct them to the first suggestion of d.suggest totally automatically. It prints a list of misspelled words, then a list of words that had no suggestions, and writes the corrected text to 'spellfixed.txt':
import enchant
import codecs
def spellcheckfile(filepath):
d = enchant.DictWithPWL("en_US","test.pwl")
try:
f = codecs.open(filepath, "r", "utf-8")
except IOError:
print "Error reading the file, right filepath?"
return
textdata = f.read()
mispelled = []
words = textdata.split()
for word in words:
# if spell check failed and the word is also not in
# mis-spelled list already, then add the word
if d.check(word) == False and word not in mispelled:
mispelled.append(word)
print mispelled
for mspellword in mispelled:
#get suggestions
suggestions=d.suggest(mspellword)
#make sure we actually got some
if len(suggestions) > 0:
# pick the first one
picksuggestion=suggestions[0]
else: print mspellword
#replace every occurence of the bad word with the suggestion
#this is almost certainly a bad idea :)
textdata = textdata.replace(mspellword,picksuggestion)
try:
fo=open("spellfixed.txt","w")
except IOError:
print "Error writing spellfixed.txt to current directory. Who knows why."
return
fo.write(textdata.encode("UTF-8"))
fo.close()
return
The issue is that the output often contains 'corrections' for words that were in either the dictionary or the pwl. For instance, when the first portion of the input was:
My NEW doctor feels that I am now bi-polar . This , after 9 years of being considered majorly depressed by everyone else
I got this:
My NEW dotor feels that I am now bipolar . This , aftER 9 years of being considERed majorly depressed by evERyone else
I could handle the case changes, but doctor --> dotor is no good at all. When the input is much shorter (for example, the above quotation is the entire imput), the result is desirable:
My NEW doctor feels that I am now bipolar . This , after 9 years of being considered majorly depressed by everyone else
Could anybody explain to me why? In very simple terms, please, as I'm very new to programming and newer to Python. A step-by-step solution would be greatly appreciated.
Upvotes: 1
Views: 2034
Reputation: 9015
#replace every occurence of the bad word with the suggestion
#this is almost certainly a bad idea :)
You were right, that is a bad idea. This is what's causing "considered" to be replaced by "considERed". Also, you're doing a replacement even when you don't find a suggestion. Move the replacement to the if len(suggestions) > 0
block.
As for replacing every instance of the word, what you want to do instead is save the positions of the misspelled words along with the text of the misspelled words (or maybe just the positions and you can look the words up in the text later when you're looking for suggestions), allow duplicate misspelled words, and only replace the individual word with its suggestion.
I'll leave the implementation details and optimizations up to you, though. A step-by-step solution won't help you learn as much.
Upvotes: 1
Reputation: 5919
I think your problem is that you're replacing letter sequences inside words. "ER" might be a valid spelling correction for "er", but that doesn't mean that you should change "considered" to "considERed".
You can use regexes instead of simple text replacement to ensure that you replace only full words. "\b" in a regex means "word boundary":
>>> "considered at the er".replace( "er", "ER" )
'considERed at the ER'
>>> import re
>>> re.sub( "\\b" + "er" + "\\b", "ER", "considered at the er" )
'considered at the ER'
Upvotes: 1