Eiyrioü von Kauyf
Eiyrioü von Kauyf

Reputation: 4735

Compress PDFs using Python

So I have a gazillion pdfs in a folder, I want to recursively (using os.path.walk) shrink them. I see that adobe pro has a save as reduced size. Would I be able to use this / how do you suggest I do it otherwise.

Note: Yes, I would like them to stay as pdfs because I find that to be the most commonly used and installed fileviewer.

Upvotes: 25

Views: 72335

Answers (5)

K J
K J

Reputation: 11849

The OP question was about "Acrobat Pro has a save as reduced size" and Acrobat Reader is in parts a significantly cut down pro for editing PDF as needed.

We can take advantage of that in a very simple manner but It is not my suggested solution because:

  • Reader is GUI based thus the edits are manually triggered in files "one by one" on "Save As" OR on closure. So not suited to batch commands.
  • To trigger the reduced file size the file must be intentionally modified thus not as close to original files as implied.
  • A by-product of the reduced size is it will be "WEB" optimised so start at the head which is opposite to the way the standard says PDFs should be processed. Thus is NOT the best method for reduction (actually increases file size).

Let us do a comparison so note Adobe Acrobat Reader is good but not the best. I will start with as yet unmentioned, a best in class command line PDF rebuilder "qpdf".

My start point is 463,937 bytes as 15 Page mixed contents source.PDF (Intentionally has 1 non critical wrong byte in its starting "startxref" point PDF per standard starts with the "trailer")

Comparison of PDF RE-compression at 100% quality. Any other compaction can only be done be degrading the Quality or hand balling optimisation.

463,937 bytes Source see note above about Linearization

qpdf corrects any faults it perceives in PDF structures

471,146 bytes qpdf in.pdf --linearize out.pdf
470,039 bytes qpdf in.pdf out.pdf (normal rebuilt/repair PDF)
468,401 bytes qpdf in.pdf --optimize-images out.pdf
468,401 bytes qpdf --stream-data=compress --recompress-flate --optimize-images pdfsizeopt.pdf outq.pdf (Compress PDF)

You may thus wonder, with all those options, why the file is not reduced to smaller and that is because most PDF files are already optimised to an ISO standard structure.

Surely there is some way to maintain quality and optimise more. Let's try some other PDF repair tools. Still not much reduced.

464,138 bytes Typical minimally "Natively" recompressed and "repaired/cleaned" without loss !

What about those mentioned by others? ( see WebView note above )

350,824 bytes GhostScript -dFastWebView -sDEVICE=pdfwrite -o"%cd%\output.pdf" -f input.pdf  

335,558 bytes Fixed AND Re-compressed as WebEnhanced by Adobe Reader DC

What about without web enhanced?

313,458 bytes cpdfSqueezed = 67.56% of original.
312,618 bytes GhostScript -sDEVICE=pdfwrite -o"%cd%\output.pdf" -f input.pdf
310,451 bytes internal PNGs optimised by PDFSizeOpt as suggested by others

So far PDFSizeOpt is the best contender as it deliberately extracts bitmaps (perhaps PNG or WebP sourced) and optimises those images using same compression as JPEG.

A false idea is that JPEG based images might be compressed more and that is not the case (unless they benefit from bit reduction, rarely the case). They are already the best PDF DCT compression internal method. thus no need to extract and modify any JPEGs.

Upvotes: 1

Yang
Yang

Reputation: 309

Consider using Ghostscript, an open-source tool for processing PostScript and PDF files.

# To install Ghostscript, use: sudo apt install ghostscript
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=compressed.pdf input.pdf

This significantly reduces the image quality of the PDF while preserving all other information, compressing a 25MB PDF paper to just 1.7MB.

Wrapped as a Python function:

def compress_pdf_file(input_path, output_path):
    import subprocess

    subprocess.call(
        [
            "gs",
            "-sDEVICE=pdfwrite",
            "-dCompatibilityLevel=1.4",
            "-dPDFSETTINGS=/screen",
            "-dNOPAUSE",
            "-dQUIET",
            "-dBATCH",
            "-sOutputFile=" + output_path,
            input_path,
        ]
    )  # To install Ghostscript, use: apt install ghostscript
    return output_path

Upvotes: 0

Jean-Francois T.
Jean-Francois T.

Reputation: 12940

pdfsizeopt was shrinking the last page of my PDF.

However, the solution provided from a now deleted answer was useful: the tool pdfc written in Python, hosted on Github and updated from time to time happened to be working fine for me.

You can download the python file pdf_compressor.py from the repo: https://github.com/theeko74/pdfc/blob/master/pdf_compressor.py

Provided you have Ghostscript installed, you can then run the following:

python pdf_compressor.py <PDF-input-file> --backup

More details on the options available in the README of the repo: https://github.com/theeko74/pdfc

Upvotes: 1

the
the

Reputation: 21911

From the project's GitHub page for pdfsizeopt, which is written in Python:

pdfsizeopt is a program for converting large PDF files to small ones. More specifically, pdfsizeopt is a free, cross-platform command-line application (for Linux, Mac OS X, Windows and Unix) and a collection of best practices to optimize the size of PDF files, with focus on PDFs created from TeX and LaTeX documents. pdfsizeopt is written in Python..."

You can probably easily adapt this to your specific needs.

Upvotes: 12

Kenneth Eggering
Kenneth Eggering

Reputation: 252

Realize this is an old question. Thought I would suggest an alternative to pdfsizeopt, as I have experienced quality loss using it for PDFs of maps. PDFTron offers a comprehensive set of functionality. Here is a snippet modified from their web-page (see "example 1"):

import site
site.addsitedir(r"...pathToPDFTron\PDFNetWrappersWin32\PDFNetC\Lib")

from PDFNetPython import PDFDoc, Optimizer, SDFDoc

doc = PDFDoc(inPDF_Path)
doc.InitSecurityHandler()
Optimizer.Optimize(doc)
doc.Save(outPDF_Path, SDFDoc.e_linearized)
doc.Close()

Upvotes: 9

Related Questions