Reputation: 23

PDFPage does not exist in Python PDFMiner library

So i pip installed pdfminer3k for python 3.6. I was trying to follow some examples in opening and converting PDF files to text and they all require a PDFPage import. This does not exist for me. Is there any work around for this? I tried copying a PDFPage.py from online and saving to the directory where python searches pdfminer but I just got... "Import Error: cannot import name PDFObjectNotFound".

Thanks!

Upvotes: 1

Answers (2)

Thomas Plocher

Reputation: 1

import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument

def extract_text_from_pdf(pdf_path):
    '''
    Iterator: extract the plain text from pdf-files with pdfminer3k

    pdf_path: path to pdf-file to be extracted
    return: iterator of string of extracted text (by page)
    '''
    # pdfminer.six-version can be found at:
    # https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
    with open(pdf_path, 'rb') as fp:
        parser = PDFParser(fp)
        doc = PDFDocument()
        parser.set_document(doc)
        doc.set_parser(parser)
        doc.initialize('')
        for page in doc.get_pages(): # pdfminer.six: PDFPage.get_pages(fh, caching=True, check_extractable=True):
            rsrcmgr = PDFResourceManager()
            fake_file_handle = io.StringIO()
            device = TextConverter(rsrcmgr, fake_file_handle, laparams=LAParams())
            interpreter = PDFPageInterpreter(rsrcmgr, device)
            interpreter.process_page(page)

            text = fake_file_handle.getvalue()
            yield text

            # close open handles
            device.close()
            fake_file_handle.close()

maxPages = 1
for i, t in enumerate(extract_text_from_pdf(fPath)):
    if i<maxPages:
        print(f"Page {i}:\n{t}")
    else:
        print(f"Page {i} skipped!")

Upvotes: 0

Angelo Niforatos

Reputation: 23

Ah. I guess the PDFPage is not meant for python 3.6. Following example from How to read pdf file using pdfminer3k? solved my issues!

Upvotes: 1

PDFPage does not exist in Python PDFMiner library

Answers (2)

Related Questions