Mihail-Cosmin Munteanu
Mihail-Cosmin Munteanu

Reputation: 522

PyPDF4 - Exported PDF file size too big

I have a PDF file of around 7000 pages and 479 MB. I have create a python script using PyPDF4 to extract only specific pages if the pages contain specific words. The script works but the new PDF file, even though it has only 650 pages from the original 7000, now has more MB that the original file (498 MB to be exactly).

Is there any way to lower the filesize of the new PDF?

The script I used:

from PyPDF4 import PdfFileWriter, PdfFileReader
import os
import re


output = PdfFileWriter()

input = PdfFileReader(open('Binder.pdf', 'rb')) # open input

for i in range(0, input.getNumPages()):
    content = ""
    content += input.getPage(i).extractText() + "\n"


    #Format 1
    RS = re.search('FIGURE', content)
    RS1 = #... Only one search given as example. I have more, but are irrelevant for the question.
    #....

    # Format 2
    RS20 = re.search('FIG.', content)
    RS21 = #... Only one search given as example. I have more, but are irrelevant for the question.
    #....

    if (all(v is not None for v in [RS, RS1, RS2, RS3, RS4, RS5, RS6, RS7, RS8, RS9]) or all(v is not None for v in [RS20, RS21, RS22, RS23, RS24, RS25, RS26, RS27, RS28, RS29, RS30, RS30])):
        p = input.getPage(i)
        output.addPage(p)

#Save pages to new PDF file
with open('ExtractedPages.pdf', 'wb') as f:
    output.write(f)

Upvotes: 8

Views: 14581

Answers (3)

Marshall Kiplinger
Marshall Kiplinger

Reputation: 21

If you're okay with losing any links in the PDF, try calling the PdfFileWriter.removeLinks() function before you save the file. I was having the same issue, but calling this function before I saved brought my file size down from 44.7MB to just 1.09MB.

Upvotes: 2

php
php

Reputation: 49

In Linux, you can compress the resulting pdf file using ps2pdf tool, which is a part of ghostscript suite. Install ghostscript:

$ sudo apt-get install ghostscript

Run the following command to reduce the size of a large pdf file

$ ps2pdf large.pdf compressed.pdf

When I tried this, I did not find any loss in quality.

Upvotes: 4

Mihail-Cosmin Munteanu
Mihail-Cosmin Munteanu

Reputation: 522

After a lot of searching found some solutions. The only problem with the exported PDF file was that it was uncompressed. So I needed a solution to compress a PDF file:

  1. PyPDF2 and/or PyPDF4 do not have an option to compress PDFs. PyPDF2 had the compressContentStreams() method, which doesn't work.

  2. Found a few other solutions that claim to compress PDFs, but none worked for me (adding them here just in case they work for others): pylovepdf ; pdfsizeopt ; pdfc

  3. The first solution that worked for me was Adobe Acrobat professional. It reduced the size from 498 MB to 2.99 MB.

  4. [Best Solution] As an alternative, open source solution that works, I found coherentpdf. For Windows you can download the pre-built PDF squeezer tool. Then in cmd:

    cpdfsqueeze.exe input.pdf output.pdf

This actually compressed the PDF even more than Adobe Acrobat. From 498 MB to 2.48 MB. Compressed to 0.5% from original. I think this is the best solution as it can be added to your Python Code.

  1. Edit: Found another Free solution that has a GUI also. PDFsam. You can use the Merge feature on one PDF file, and in the advanced Settings make sure you have the Compress Output checked. This compressed from 498 to 3.2 MB. enter image description here

Upvotes: 11

Related Questions