Reputation:
I want to read a PDF and get a list of its pages and the size of each page. I don't need to manipulate it in any way, just read it.
Currently trying out pyPdf and it does everything I need except a way to get page sizes. Understanding that I will probably have to iterate through, as page sizes can vary in a pdf document. Is there another libray/method I can use?
I tried using PIL, some online recipes even have d=Image(imagefilename) usage, but it NEVER reads any of my PDFs - it reads everything else I throw at it - even some things I didn't know PIL could do.
Any guidance appreciated - I'm on windows 7 64, python25 (because I also do GAE stuff), but I'm happy to do it in Linux or more modern pythiis.
Upvotes: 43
Views: 77118
Reputation: 540
Update in 2023-08-30: Provide example pdf, add cropbox
Update in 2021-07-22: original answer was not always correct, so I update my answer.
With PyMuPDF:
>>> import fitz
>>> doc = fitz.open("example.pdf")
>>> page = doc[0]
>>> print(page.rect.width, page.rect.height)
284.0 473.0
>>> print(page.mediabox.width, page.mediabox.height)
595.304 841.89
>>> print(page.cropbox.width, page.cropbox.height)
473.0 284.0
Return values of mediabox, cropbox and rect are of type Rect, which has attributes "width" and "height". For most people, rect is probably the most useful.
These three are identical most of the time, but occasionally they can be very different: cropbox and rect are the visible area of the page (what you see from a regular pdf viewer), while mediabox is the physical medium.
One difference between cropbox and rect is that cropbox is the same as /CropBox in document and does not change if page is rotated. However, rect is affected by rotation. For more information about different boxes in PyMuPDF, you can read glossary. Also see PDF documentation 14.11.2.1.
Sample pdf can be downloaded here.
You can install this module with pip install pymupdf
.
Upvotes: 26
Reputation: 2337
Using pypdfium2:
import pypdfium2 as pdfium
PAGEINDEX = 0 # the first page
FILEPATH = "/path/to/file.pdf"
pdf = pdfium.PdfDocument(FILEPATH)
# option 1
width, height = pdf.get_page_size(PAGEINDEX)
# option 2
page = pdf[PAGEINDEX]
width, height = page.get_size()
# len(pdf) provides the number of pages, so you can iterate through the document
disclaimer: I'm the maintainer
Upvotes: 1
Reputation: 9012
disclaimer: I am the author of borb
, the library used in this answer.
#!chapter_005/src/snippet_002.py
import typing
from borb.pdf import Document
from borb.pdf import PDF
def main():
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle)
# check whether we have read a Document
assert doc is not None
# get the width/height
w = doc.get_page(0).get_page_info().get_width()
h = doc.get_page(0).get_page_info().get_height()
# do something with these dimensions
# TODO
if __name__ == "__main__":
main()
We start the code by loading the PDF using PDF.loads
.
Then we get a Page
(you could change this code to print the dimensions of each Page
, rather than just Page
0).
From that Page
, we get PageInfo
, which contains the width and height.
You can install borb
by using pip
:
pip install borb
You can also download it from source here.
If you need further examples, check out the examples repository.
Upvotes: 2
Reputation: 177510
This can be done with pypdf:
>>> from pypdf import PdfReader
>>> reader = PdfReader('example.pdf')
>>> box = reader.pages[0].mediabox
>>> box
RectangleObject([0, 0, 612, 792])
>>> box.width
Decimal('612')
>>> box.height
Decimal('792')
(Formerly known as pyPdf / PyPDF2)
Upvotes: 62
Reputation: 11
Right code for Python 3.9 and library PyPDF2:
from PyPDF2 import PdfFileReader
reader = PdfFileReader('C:\\MyFolder\\111.pdf')
box = reader.pages[0].mediaBox
print(box.getWidth())
print(box.getHeight())
For all pages:
from PyPDF2 import PdfFileReader
reader = PdfFileReader('C:\\MyFolder\\111.pdf')
i = 0
for p in reader.pages:
box = p.mediaBox
print(f"i:{i} page:{i+1} Width = {box.getWidth()} Height = {box.getHeight()}")
i=i+1
input("Press Enter to continue...")
Upvotes: -2
Reputation: 428
With pdfrw:
>>> from pdfrw import PdfReader
>>> pdf = PdfReader('example.pdf')
>>> pdf.pages[0].MediaBox
['0', '0', '595.2756', '841.8898']
Lengths are given in points (1 pt = 1/72 inch). The format is [x0, y0, x1, y1]
(thanks, mara004!).
Upvotes: 17
Reputation: 2337
With pikepdf:
import pikepdf
# open the file and select the first page
pdf = pikepdf.Pdf.open("/path/to/file.pdf")
page = pdf.pages[0]
if '/CropBox' in page:
# use CropBox if defined since that's what the PDF viewer would usually display
relevant_box = page.CropBox
elif '/MediaBox' in page:
relevant_box = page.MediaBox
else:
# fall back to ANSI A (US Letter) if neither CropBox nor MediaBox are defined
# unlikely, but possible
relevant_box = [0, 0, 612, 792]
# actually there could also be a viewer preference ViewArea or ViewClip in
# pdf.Root.ViewerPreferences defining which box to use, but most PDF readers
# disregard this option anyway
# check whether the page defines a UserUnit
userunit = 1
if '/UserUnit' in page:
userunit = float(page.UserUnit)
# convert the box coordinates to float and multiply with the UserUnit
relevant_box = [float(x)*userunit for x in relevant_box]
# obtain the dimensions of the box
width = abs(relevant_box[2] - relevant_box[0])
height = abs(relevant_box[3] - relevant_box[1])
rotation = 0
if '/Rotate' in page:
rotation = page.Rotate
# if the page is rotated clockwise or counter-clockwise, swap width and height
# (pdf rotation modifies the coordinate system, so the box always refers to
# the non-rotated page)
if (rotation // 90) % 2 != 0:
width, height = height, width
# now you have width and height in points
# 1 point is equivalent to 1/72in (1in -> 2.54cm)
Upvotes: 7
Reputation: 365
for pdfminer python 3.x (pdfminer.six) (did not try on python 2.7):
parser = PDFParser(open(pdfPath, 'rb'))
doc = PDFDocument(parser)
pageSizesList = []
for page in PDFPage.create_pages(doc):
print(page.mediabox) # <- the media box that is the page size as list of 4 integers x0 y0 x1 y1
pageSizesList.append(page.mediabox) # <- appending sizes to this list. eventually the pageSizesList will contain list of list corresponding to sizes of each page
Upvotes: 8
Reputation: 1
Another way is to use popplerqt4
doc = popplerqt4.Poppler.Document.load('/path/to/my.pdf')
qsizedoc = doc.page(0).pageSize()
h = qsizedoc.height() # given in pt, 1pt = 1/72 in
w = qsizedoc.width()
Upvotes: -1