Reputation: 1503
Just trying to extract the text from a PDF in Python, using the Slate Library and PyPDF2. Unfortunately some PDFs are being output with multiple words merged/concatenated together. This seems to happen intermittently, for example for some PDFs words are extracted with the spaces between them correctly, whereas others are not.
One example of a PDF where words are not extracted correctly is included and available for download (SO wouldn't let me upload it) here. The output from
slate.PDF(open(name, 'rb') ).text()
is (or at least a segment is):
,notonadhocprocedures,andcanbeusedwithdatacollectedatmul-tiplespatialresolutions(Kulldorff1999).Ifdataontheabundanceofataxonovertimeareavailable,thesedatacanbeincorporatedintoanSTPSanalysistoincreasethesensitivityandreliabilityofthemodeltodetectsightingclusters,
where of course the first comma-separated token should be not on adhoc procedures
Does anybody know why this is happening, or have a better idea of a library to use for PDF text extraction?
Upvotes: 1
Views: 897