Reputation: 121
In pypdf, I can get the total number of pages of a PDF file via:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
no_of_pages = len(reader.pages)
How can I get this using PDFMiner?
Upvotes: 9
Views: 22200
Reputation: 136347
I realize you were asking for PDFMiner. However, people coming via Google Search to this question might also be interested in alternatives to PDFMiner.
pypdf is a pure-python alternative:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
pdf_page_count = len(reader.pages)
from pikepdf import Pdf
pdf_doc = Pdf.open('fourpages.pdf')
pdf_page_count = len(pdf_doc.pages)
Upvotes: 5
Reputation: 4785
I found PDFMiner very slow in getting the total number of pages. I found this to be a cleaner and faster solution:
pip3 install PyPDF2
from PyPDF2 import PdfFileReader
def get_pdf_page_count(path):
with open(path, 'rb') as fl:
reader = PdfFileReader(fl)
return reader.getNumPages()
Upvotes: 0
Reputation: 10871
I hate to just leave a code snippet. For context here is a link to the current pdfminer.six repo where you might be able to learn a little more about the resolve1
method.
As you're working with PDFMiner, you might print and come across some PDFObjRef
objects. Essentially you can use resolve1
to expand those objects (they're usually a dictionary).
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import resolve1
file = open('some_file.pdf', 'rb')
parser = PDFParser(file)
document = PDFDocument(parser)
# This will give you the count of pages
print(resolve1(document.catalog['Pages'])['Count'])
Upvotes: 27
Reputation: 89
Using pdfminer.six you just need to import the high level function extract_pages
, convert the generator into a list and take its length.
from pdfminer.high_level import extract_pages
print(len(list(extract_pages(pdf_file))))
Upvotes: 8
Reputation: 1912
Using pdfminer
,import
the necessary modules.
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
Create a PDF parser object associated with the file object.
fp = open('your_file.pdf', 'rb')
parser = PDFParser(fp)
Create a PDF document object that stores the document structure.
document = PDFDocument(parser)
Iterate through the create_pages()
function incrementing each time there is a page.
num_pages = 0
for page in PDFPage.create_pages(document):
num_pages += 1
print(num_pages)
Upvotes: 2