Reputation: 1217
I have the following text in a PDF:
STUDENT ________JOHN______
DATE ______MM/DD/AAA______ (date)
COURSE ___________________ PROFESSOR ___________
When I use PDFMiner to extract the text, I get the following:
STUDENT ____
DATE MM/DD/AAA
(date)
JOHN
COURSE
___________________
PROFESSOR
___________
How can I get the correct output using PDFMiner (or other Python lib)?
Upvotes: 2
Views: 1683
Reputation: 11
The best way to do to that is by extracting the PDF as HTML using pdfminer HTMLConverter. A typical command will be:
pdf2txt.py -t html -o outputFilePath/outputFileName.txt YourPDFpath/PDFname.pdf
Further processing can get you in some encoding dilemma, so better define the encoding as either utf-8 or cp1252.Example:
pdf2txt.py -t html -c cp1252 -o outputFilePath/outputFileName.txt YourPDFpath/PDFname.pdf
Upvotes: 1