alexlong1117
alexlong1117

Reputation: 11

PyPDF2 - PdfFileReader - cannot extract text

I am looping through a directory and reading in numerous PDFs. I am extracting all text information from each page using a loop.

5/13 PDFs are throwing an error when trying to use .getNumPages(): Exception has occurred: ValueError invalid literal for int() with base 10: b''. I believe this error is occurring because the object (PyPDF2) is showing numPages: 0.

Current Code

dir = os.listdir(directory)

for f in dir:
object = PyPDF2.PdfFileReader(directory + '\\' + f)

NumPages = object.getNumPages()
text_output = ""  # Initiate Variable

# Loop through all pages and extract/merge text
with open(directory + '\\' + f, mode='rb') as FileName:
    reader = PyPDF2.PdfFileReader(FileName)
    for p_num in range(0, NumPages):
        page = reader.getPage(p_num)
        text_output = text_output + '\n' + 'PAGE: ' + \
            str(p_num + 1) + '\n' + page.extractText()

I added an image showing the object data where numPages: 0

I cannot figure out why only certain PDFs are having this issue. Any help would be appreciated!!

Upvotes: 1

Views: 3590

Answers (4)

Saurav kumar
Saurav kumar

Reputation: 1

import PyPDF2

a=PyPDF2.PdfReader('Check.pdf')
print(len(a.pages))

Upvotes: 0

marcin
marcin

Reputation: 567

I have tested few pdf libraries and I have noticed PyMuPDF is best in reading pdf files.

Here code example:

import fitz

doc = fitz.open("file.pdf")

for page in doc:
    text = page.getText()
    print(text)

Upvotes: 1

Thuấn Đào Minh
Thuấn Đào Minh

Reputation: 528

I use pdfminer to extract pdf.

You can refer example code.

#pip install pdfminer.six
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()


if __name__ == "__main__":
    print(convert_pdf_to_txt('test.pdf'))

For more information about this lid. You can refer link below

PDFminer

Please check and respond to me if have any issue occur.

Upvotes: 0

coderina
coderina

Reputation: 1746

Well I also faced the same issues with PyPDF2 , so I used another python library named slate

  • Install the library

    pip install slate3k
    
  • Then use the below code

    import slate3k as slate
    
    with open(file.pdf, 'rb') as f:
      extracted_text = slate.PDF(f)
      print(extracted_text)
    

Upvotes: 0

Related Questions