Vedran Šego
Vedran Šego

Reputation: 3765

Extract page sizes from large PDFs

I need to extract the number of pages and their sizes in px/mm/cm/some-unit from PDF files using Python (sadly, 2.7, because it's a legacy project). The problem is that the files can be truly huge (hundreds of MiBs) because they'll contain large images.

I do not care for this content and I really want just a list of page sizes from the file, with as little consumption of RAM as possible.

I found quite a few libraries that can do that (included, but not limited, to the ones in the answers here), but none provide any remarks on the memory usage, and I suspect that most of them - if not all - read the whole file in memory before doing anything with it, which doesn't fit my purpose.

Are there any libraries that extract only structure and give me the data that I need without clogging my RAM?

Upvotes: 0

Views: 1421

Answers (2)

Vedran Šego
Vedran Šego

Reputation: 3765

Inspired by the other answer, I found that libvips, which is suggested there, uses poppler (it can fall back to some other library if it cannot find poppler).

So, instead of using the superpowerful pyvips, which seems great for multiple types of documents, I went with just poppler, which has multiple Python libraries. I picked pdflib and came up with this solution:

from sys import argv

from pdflib import Document


doc = Document(argv[1])
for num, page in enumerate(doc, start=1):
    print(num, tuple(2.54 * x / 72 for x in page.size))

The 2.54 * x / 72 part converts from px to cm, nothing more.

Speed and memory test on a 264MiB file with one huge image per page:

$ /usr/bin/time -f %M\ %e python t2.py big.pdf 
1 (27.99926666666667, 20.997333333333337)
2 (27.99926666666667, 20.997333333333337)
...
56 (27.99926666666667, 20.997333333333337)
21856 0.09

Just for the reference, if anyone is looking a pure Python solution, I made a crude one which is available here. Not thoroughly tested and much, much slower than this (some 30sec for the above).

Upvotes: 0

jcupitt
jcupitt

Reputation: 11179

pyvips can do this. It loads the file structure when you open the PDF and only renders each page when you ask for pixels.

For example:

#!/usr/bin/python

import sys
import pyvips

i = 0
while True:
    try:
        x = pyvips.Image.new_from_file(sys.argv[1], dpi=300, page=i)
        print("page =", i)
        print("width =", x.width)
        print("height =", x.height)
    except:
        break

    i += 1

libvips 8.7, due in another week or so, adds a new metadata item called n-pages you can use to get the length of the document. Until that is released though you need to just keep incrementing the page number until you get an error.

Using this PDF, when I run the program I see:

$ /usr/bin/time -f %M:%e ./sizes.py ~/pics/r8.pdf 
page = 0
width = 2480
height = 2480
page = 1
width = 2480
height = 2480
page = 2
width = 4960
height = 4960
...
page = 49
width = 2480
height = 2480
55400:0.19

So it opened 50 pages in 0.2s real time, with a total peak memory use of 55mb. That's with py3, but it works fine with py2 as well. The dimensions are in pixels at 300 DPI.

If you set page to -1, it'll load all the pages in the document as a single very tall image. All the pages need to be the same size for this though, sadly.

Upvotes: 1

Related Questions