Splitting a string into relevant words

Question

After using a PDF parser (pdfminer) and tokenization (nltk package) I have a few string ~words that are really a combination of other words, yet have no punctuation or spacing for easy splitting.

My output has many correct word splittings, but also occasionally items like: 'simpleexamplelabeleddatalikelihood' - ideally I would want to split this into 'simple', 'example', 'labeled', 'data', 'likelihood'. I will be operating on a large corpus of documents and so am likely to get some very odd combinations of words/~sentence strings and could not predict what words are combined without actually looking at the output and doing so by hand. Are there any packages that would say "oh, this string is a composite of X, Y & Z words, so lets split it into X, Y & Z?" If one does exist, is it actually any accurate? My personal thoughts are that this seems like a semi-hopeless problem due to issues like the name "Thea" being split into "the" and "a", but perhaps those cases are rare enough that there is an accurate-ish package out there?

Splitting a string into relevant words

Answers (1)

Related Questions