Python Slate Library: PDF text extraction concatenating words

Question

Just trying to extract the text from a PDF in Python, using the Slate Library and PyPDF2. Unfortunately some PDFs are being output with multiple words merged/concatenated together. This seems to happen intermittently, for example for some PDFs words are extracted with the spaces between them correctly, whereas others are not.

One example of a PDF where words are not extracted correctly is included and available for download (SO wouldn't let me upload it) here. The output from

slate.PDF(open(name, 'rb') ).text()

is (or at least a segment is):

,notonadhocprocedures,andcanbeusedwithdatacollectedatmul-tiplespatialresolutions(Kulldorff1999).Ifdataontheabundanceofataxonovertimeareavailable,thesedatacanbeincorporatedintoanSTPSanalysistoincreasethesensitivityandreliabilityofthemodeltodetectsightingclusters,

where of course the first comma-separated token should be not on adhoc procedures

Does anybody know why this is happening, or have a better idea of a library to use for PDF text extraction?

Python Slate Library: PDF text extraction concatenating words

Answers (0)

Related Questions