Bif
Bif

Reputation:

Extracting page sizes from PDF in Python

I want to read a PDF and get a list of its pages and the size of each page. I don't need to manipulate it in any way, just read it.

Currently trying out pyPdf and it does everything I need except a way to get page sizes. Understanding that I will probably have to iterate through, as page sizes can vary in a pdf document. Is there another libray/method I can use?

I tried using PIL, some online recipes even have d=Image(imagefilename) usage, but it NEVER reads any of my PDFs - it reads everything else I throw at it - even some things I didn't know PIL could do.

Any guidance appreciated - I'm on windows 7 64, python25 (because I also do GAE stuff), but I'm happy to do it in Linux or more modern pythiis.

Upvotes: 43

Views: 77118

Answers (9)

cges30901
cges30901

Reputation: 540

Update in 2023-08-30: Provide example pdf, add cropbox

Update in 2021-07-22: original answer was not always correct, so I update my answer.

With PyMuPDF:

>>> import fitz
>>> doc = fitz.open("example.pdf")
>>> page = doc[0]
>>> print(page.rect.width, page.rect.height)
284.0 473.0
>>> print(page.mediabox.width, page.mediabox.height)
595.304 841.89
>>> print(page.cropbox.width, page.cropbox.height)
473.0 284.0

Return values of mediabox, cropbox and rect are of type Rect, which has attributes "width" and "height". For most people, rect is probably the most useful.

These three are identical most of the time, but occasionally they can be very different: cropbox and rect are the visible area of the page (what you see from a regular pdf viewer), while mediabox is the physical medium.

One difference between cropbox and rect is that cropbox is the same as /CropBox in document and does not change if page is rotated. However, rect is affected by rotation. For more information about different boxes in PyMuPDF, you can read glossary. Also see PDF documentation 14.11.2.1.

Sample pdf can be downloaded here.

You can install this module with pip install pymupdf.

Upvotes: 26

mara004
mara004

Reputation: 2337

Using pypdfium2:

import pypdfium2 as pdfium

PAGEINDEX = 0  # the first page
FILEPATH = "/path/to/file.pdf"
pdf = pdfium.PdfDocument(FILEPATH)

# option 1
width, height = pdf.get_page_size(PAGEINDEX)

# option 2
page = pdf[PAGEINDEX]
width, height = page.get_size()

# len(pdf) provides the number of pages, so you can iterate through the document

disclaimer: I'm the maintainer

Upvotes: 1

Joris Schellekens
Joris Schellekens

Reputation: 9012

disclaimer: I am the author of borb, the library used in this answer.

#!chapter_005/src/snippet_002.py
import typing
from borb.pdf import Document
from borb.pdf import PDF


def main():

    # read the Document
    doc: typing.Optional[Document] = None
    with open("output.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle)

    # check whether we have read a Document
    assert doc is not None

    # get the width/height
    w = doc.get_page(0).get_page_info().get_width()
    h = doc.get_page(0).get_page_info().get_height()

    # do something with these dimensions
    # TODO

if __name__ == "__main__":
    main()

We start the code by loading the PDF using PDF.loads. Then we get a Page (you could change this code to print the dimensions of each Page, rather than just Page 0). From that Page, we get PageInfo, which contains the width and height.

You can install borb by using pip:

pip install borb

You can also download it from source here.

If you need further examples, check out the examples repository.

Upvotes: 2

Josh Lee
Josh Lee

Reputation: 177510

This can be done with pypdf:

>>> from pypdf import PdfReader
>>> reader = PdfReader('example.pdf')
>>> box = reader.pages[0].mediabox
>>> box
RectangleObject([0, 0, 612, 792])
>>> box.width
Decimal('612')
>>> box.height
Decimal('792')

(Formerly known as pyPdf / PyPDF2)

Upvotes: 62

Leo S
Leo S

Reputation: 11

Right code for Python 3.9 and library PyPDF2:

from PyPDF2 import PdfFileReader

reader = PdfFileReader('C:\\MyFolder\\111.pdf')
box = reader.pages[0].mediaBox
print(box.getWidth())
print(box.getHeight())

For all pages:

from PyPDF2 import PdfFileReader

reader = PdfFileReader('C:\\MyFolder\\111.pdf')

i = 0
for p in reader.pages:
    box = p.mediaBox
    print(f"i:{i}   page:{i+1}   Width = {box.getWidth()}   Height = {box.getHeight()}")
    i=i+1
    
input("Press Enter to continue...")

Upvotes: -2

Jamy Mahabier
Jamy Mahabier

Reputation: 428

With pdfrw:

>>> from pdfrw import PdfReader
>>> pdf = PdfReader('example.pdf')
>>> pdf.pages[0].MediaBox
['0', '0', '595.2756', '841.8898']

Lengths are given in points (1 pt = 1/72 inch). The format is [x0, y0, x1, y1] (thanks, mara004!).

Upvotes: 17

mara004
mara004

Reputation: 2337

With pikepdf:

import pikepdf

# open the file and select the first page
pdf = pikepdf.Pdf.open("/path/to/file.pdf")
page = pdf.pages[0]

if '/CropBox' in page:
    # use CropBox if defined since that's what the PDF viewer would usually display
    relevant_box = page.CropBox
elif '/MediaBox' in page:
    relevant_box = page.MediaBox
else:
    # fall back to ANSI A (US Letter) if neither CropBox nor MediaBox are defined
    # unlikely, but possible
    relevant_box = [0, 0, 612, 792]

# actually there could also be a viewer preference ViewArea or ViewClip in
# pdf.Root.ViewerPreferences defining which box to use, but most PDF readers 
# disregard this option anyway

# check whether the page defines a UserUnit
userunit = 1
if '/UserUnit' in page:
    userunit = float(page.UserUnit)

# convert the box coordinates to float and multiply with the UserUnit
relevant_box = [float(x)*userunit for x in relevant_box]

# obtain the dimensions of the box
width  = abs(relevant_box[2] - relevant_box[0])
height = abs(relevant_box[3] - relevant_box[1])

rotation = 0
if '/Rotate' in page:
    rotation = page.Rotate

# if the page is rotated clockwise or counter-clockwise, swap width and height
# (pdf rotation modifies the coordinate system, so the box always refers to 
# the non-rotated page)
if (rotation // 90) % 2 != 0:
    width, height = height, width

# now you have width and height in points
# 1 point is equivalent to 1/72in (1in -> 2.54cm)

Upvotes: 7

Myonaiz
Myonaiz

Reputation: 365

for pdfminer python 3.x (pdfminer.six) (did not try on python 2.7):

parser = PDFParser(open(pdfPath, 'rb'))
doc = PDFDocument(parser)
pageSizesList = []
for page in PDFPage.create_pages(doc):
    print(page.mediabox) # <- the media box that is the page size as list of 4 integers x0 y0 x1 y1
    pageSizesList.append(page.mediabox) # <- appending sizes to this list. eventually the pageSizesList will contain list of list corresponding to sizes of each page

Upvotes: 8

Alexander Marin
Alexander Marin

Reputation: 1

Another way is to use popplerqt4

doc = popplerqt4.Poppler.Document.load('/path/to/my.pdf')
qsizedoc = doc.page(0).pageSize()
h = qsizedoc.height() # given in pt,  1pt = 1/72 in
w = qsizedoc.width() 

Upvotes: -1

Related Questions