PythonSherpa
PythonSherpa

Reputation: 2600

PyPDF2 returns only empty lines for some files

I am working on a script that "reads" PDF files and and then automatically renames the files it recognizes from a dictionary. PyPDF2 however only returns empty lines for some PDFs, while working fine for others. The code for reading the files:

import PyPDF2

# File name
file = 'sample.pdf'

# Open File
with open(file, "rb") as f:
    # Read in file
    pdfReader = PyPDF2.PdfFileReader(f)

    # Check number of pages
    number_of_pages = pdfReader.numPages
    print(number_of_pages)

    # Get first page
    pageObj = pdfReader.getPage(0)

    # Extract text from page 1
    text = pageObj.extractText()        

print(text)

It does get the number of pages correctly, so it is able to open the PDF.

If I replace the print(text) by repr(text) for files it doesn't read, I get something like:

"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"

Weirdly enough, when I enhance (OCR) the files with Adobe, the script performs slightly worse. It Recognized 140 out of 800 files and after enhancing just 110.

The PDFs are machine readable/searchable, because I am able to copy/paste text to Notepad. I tested some files with "pdfminer" and it does show some text, but also throws in a lot of errors. If possible I like to keep working with PyPDF2.

Specifications of the software I am using:
Windows: 10.0.15063
Python: 3.6.1
PyPDF: 1.26.0
Adobe version: 17.009.20058

Anyone any suggestions? Your help is very much appreciated!

Upvotes: 2

Views: 1620

Answers (1)

vikalp rusia
vikalp rusia

Reputation: 97

I had the same issue, i fixed it using another python library called slate Fortunately, i found a fork that works in Python 3.6.5

import slate3k as slate

with open(file.pdf,'rb') as f:
    extracted_text = slate.PDF(f)
print(extracted_text)

Upvotes: 1

Related Questions