Reputation: 1217

PDFMiner extract text from PDF without mixing the order

I have the following text in a PDF:

STUDENT ________JOHN______
DATE ______MM/DD/AAA______ (date)
COURSE ___________________ PROFESSOR ___________

When I use PDFMiner to extract the text, I get the following:

STUDENT ____
DATE MM/DD/AAA
(date)
JOHN
COURSE 
___________________ 
PROFESSOR 
___________

How can I get the correct output using PDFMiner (or other Python lib)?

Upvotes: 2

Answers (1)

Reputation: 11

The best way to do to that is by extracting the PDF as HTML using pdfminer HTMLConverter. A typical command will be:

pdf2txt.py -t html -o outputFilePath/outputFileName.txt YourPDFpath/PDFname.pdf

Further processing can get you in some encoding dilemma, so better define the encoding as either utf-8 or cp1252.Example:

pdf2txt.py -t html -c cp1252 -o outputFilePath/outputFileName.txt YourPDFpath/PDFname.pdf

Upvotes: 1