Reputation: 23

Counting coloured pages in a PDF

I have adapted this code from another StackOverflow post. It converts a PDF page to an Image and checks the Hue/Saturation values for colour. My only issue is that it is very slow, almost takes a minute for 25 pages. Does anyone have any ideas on how I can make it more efficient?

from pdf2image import convert_from_path
import numpy as np

def main():
    images = convert_from_path("example1.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
    sw=0
    color=0

    for image in images:
        img = np.array(image.convert('HSV'))
        hsv_sum = img.sum(0).sum(0)
        if hsv_sum[0] == 0 and hsv_sum[1] == 0:
            sw += 1
        else:
            color += 1
    print(color)
    print(sw)

Upvotes: 1

Answers (3)

Mark Setchell

Reputation: 207465

Using dpi=500 is going to make unnecessarily large demands on your memory if you are just trying to coarsely detect (probably large) regions of colour.

I would try dpi=72 or even dpi=36 and see if it is still accurate enough.

Further than that, if you are trying to speed things up it is important to measure what is slow - no point speeding up some aspect of your processing that only takes 1% of the time. So, measure how long it takes to convert all the PDF pages to PIL Image and then measure the time for analysing each page so that you know where to direct your efforts.

If the pages take a long time each to process, consider doing the pages in parallel.

Upvotes: 1

Joris Schellekens

Reputation: 9012

disclaimer I am the author of borb, the library used in this answer

Depending on what exactly is colored in the page, you could use borb to get this done.

borb has the concept of EventListener, which gets notified of rendering instructions (as they are coming out of the parser).

This should be as fast as simply reading the PDF.

edit: based on your comment, I am including links to the following examples.

These examples might seem lengthy, but they are complete (in the sense that they will first create the PDF to later extract content/information from)

Upvotes: 0

Muhammad Shiddiq

Reputation: 11

try use this

import PyPDF2

pdf_file = open('nama_file.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

colored_page_count = 0

for page in pdf_reader.pages:
  if page.get("/ColorSpace") == "/DeviceRGB":
    colored_page_count += 1

print(colored_page_count)

pdf_file.close()

Upvotes: 0

Counting coloured pages in a PDF

Answers (3)

Related Questions