Evan Mata
Evan Mata

Reputation: 612

Splitting a string into relevant words

After using a PDF parser (pdfminer) and tokenization (nltk package) I have a few string ~words that are really a combination of other words, yet have no punctuation or spacing for easy splitting.

My output has many correct word splittings, but also occasionally items like: 'simpleexamplelabeleddatalikelihood' - ideally I would want to split this into 'simple', 'example', 'labeled', 'data', 'likelihood'. I will be operating on a large corpus of documents and so am likely to get some very odd combinations of words/~sentence strings and could not predict what words are combined without actually looking at the output and doing so by hand. Are there any packages that would say "oh, this string is a composite of X, Y & Z words, so lets split it into X, Y & Z?" If one does exist, is it actually any accurate? My personal thoughts are that this seems like a semi-hopeless problem due to issues like the name "Thea" being split into "the" and "a", but perhaps those cases are rare enough that there is an accurate-ish package out there?

Upvotes: 0

Views: 120

Answers (1)

Igor
Igor

Reputation: 1281

Not sure to what extent this problem would relate to that of compound splitting (i.e. to some extent certainly, but sounds like your input will mostly not be actual compounds). But you may look into that direction for answers, perhaps check out https://pypi.org/project/compound-word-splitter/?

Upvotes: 0

Related Questions