Reputation: 11
I am looping through a directory and reading in numerous PDFs. I am extracting all text information from each page using a loop.
5/13 PDFs are throwing an error when trying to use .getNumPages(): Exception has occurred: ValueError invalid literal for int() with base 10: b''. I believe this error is occurring because the object (PyPDF2) is showing numPages: 0.
dir = os.listdir(directory)
for f in dir:
object = PyPDF2.PdfFileReader(directory + '\\' + f)
NumPages = object.getNumPages()
text_output = "" # Initiate Variable
# Loop through all pages and extract/merge text
with open(directory + '\\' + f, mode='rb') as FileName:
reader = PyPDF2.PdfFileReader(FileName)
for p_num in range(0, NumPages):
page = reader.getPage(p_num)
text_output = text_output + '\n' + 'PAGE: ' + \
str(p_num + 1) + '\n' + page.extractText()
I added an image showing the object data where numPages: 0
I cannot figure out why only certain PDFs are having this issue. Any help would be appreciated!!
Upvotes: 1
Views: 3590
Reputation: 1
import PyPDF2
a=PyPDF2.PdfReader('Check.pdf')
print(len(a.pages))
Upvotes: 0
Reputation: 567
I have tested few pdf libraries and I have noticed PyMuPDF is best in reading pdf files.
Here code example:
import fitz
doc = fitz.open("file.pdf")
for page in doc:
text = page.getText()
print(text)
Upvotes: 1
Reputation: 528
I use pdfminer to extract pdf.
You can refer example code.
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('test.pdf'))
For more information about this lid. You can refer link below
Please check and respond to me if have any issue occur.
Upvotes: 0
Reputation: 1746
Well I also faced the same issues with PyPDF2
, so I used another python library named slate
Install the library
pip install slate3k
Then use the below code
import slate3k as slate
with open(file.pdf, 'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)
Upvotes: 0