Martien Lubberink
Martien Lubberink

Reputation: 2735

Python: how to automatically spellcheck and correct joined words such as "reportthatexplains" and "havebeen"

I have some large text files which are in correct English because extracted from pdfs. However, many words in these text files are joined: "informationotherwise", "havebeen", "reportthatexplains". Every spell checker will spot these errors, e.g. LanguageTool, Sublime, MS-Word. However, Python struggles.

I tried pyspellchecker and TextBlob to check and correct these words, but, alas, to no avail.

See for example this code, which returns None three times.

misspelled = spell.unknown(["informationotherwise", "havebeen", "reportthatexplains"])

for word in misspelled:
    print(spell.correction(word))
    print(spell.candidates(word))

And this code:

t ="havebeen"
TextBlob(t).correct().string

>>> 'havebeen'

Any suggestions?

Upvotes: 0

Views: 452

Answers (1)

qaiser
qaiser

Reputation: 2868

Use word ninja library for splitting long word into sub word

import wordninja
word  = ["informationotherwise", "havebeen", "reportthatexplains"]
for x in word :
    print(' '.join(wordninja.split(x)))

 #op
 information otherwise
 have been
 report that explains

Upvotes: 4

Related Questions