Reputation: 2600
I am working on a script that "reads" PDF files and and then automatically renames the files it recognizes from a dictionary. PyPDF2 however only returns empty lines for some PDFs, while working fine for others. The code for reading the files:
import PyPDF2
# File name
file = 'sample.pdf'
# Open File
with open(file, "rb") as f:
# Read in file
pdfReader = PyPDF2.PdfFileReader(f)
# Check number of pages
number_of_pages = pdfReader.numPages
print(number_of_pages)
# Get first page
pageObj = pdfReader.getPage(0)
# Extract text from page 1
text = pageObj.extractText()
print(text)
It does get the number of pages correctly, so it is able to open the PDF.
If I replace the print(text) by repr(text) for files it doesn't read, I get something like:
"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
Weirdly enough, when I enhance (OCR) the files with Adobe, the script performs slightly worse. It Recognized 140 out of 800 files and after enhancing just 110.
The PDFs are machine readable/searchable, because I am able to copy/paste text to Notepad. I tested some files with "pdfminer" and it does show some text, but also throws in a lot of errors. If possible I like to keep working with PyPDF2.
Specifications of the software I am using:
Windows: 10.0.15063
Python: 3.6.1
PyPDF: 1.26.0
Adobe version: 17.009.20058
Anyone any suggestions? Your help is very much appreciated!
Upvotes: 2
Views: 1620
Reputation: 97
I had the same issue, i fixed it using another python library called slate Fortunately, i found a fork that works in Python 3.6.5
import slate3k as slate
with open(file.pdf,'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)
Upvotes: 1