Reputation: 2735
I have some large text files which are in correct English because extracted from pdfs. However, many words in these text files are joined: "informationotherwise", "havebeen", "reportthatexplains". Every spell checker will spot these errors, e.g. LanguageTool, Sublime, MS-Word. However, Python struggles.
I tried pyspellchecker and TextBlob to check and correct these words, but, alas, to no avail.
See for example this code, which returns None three times.
misspelled = spell.unknown(["informationotherwise", "havebeen", "reportthatexplains"])
for word in misspelled:
print(spell.correction(word))
print(spell.candidates(word))
And this code:
t ="havebeen"
TextBlob(t).correct().string
>>> 'havebeen'
Any suggestions?
Upvotes: 0
Views: 452
Reputation: 2868
Use word ninja library for splitting long word into sub word
import wordninja
word = ["informationotherwise", "havebeen", "reportthatexplains"]
for x in word :
print(' '.join(wordninja.split(x)))
#op
information otherwise
have been
report that explains
Upvotes: 4