Reputation: 23
I have adapted this code from another StackOverflow post. It converts a PDF page to an Image and checks the Hue/Saturation values for colour. My only issue is that it is very slow, almost takes a minute for 25 pages. Does anyone have any ideas on how I can make it more efficient?
from pdf2image import convert_from_path
import numpy as np
def main():
images = convert_from_path("example1.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
sw=0
color=0
for image in images:
img = np.array(image.convert('HSV'))
hsv_sum = img.sum(0).sum(0)
if hsv_sum[0] == 0 and hsv_sum[1] == 0:
sw += 1
else:
color += 1
print(color)
print(sw)
Upvotes: 1
Views: 751
Reputation: 207465
Using dpi=500
is going to make unnecessarily large demands on your memory if you are just trying to coarsely detect (probably large) regions of colour.
I would try dpi=72
or even dpi=36
and see if it is still accurate enough.
Further than that, if you are trying to speed things up it is important to measure what is slow - no point speeding up some aspect of your processing that only takes 1% of the time. So, measure how long it takes to convert all the PDF pages to PIL Image
and then measure the time for analysing each page so that you know where to direct your efforts.
If the pages take a long time each to process, consider doing the pages in parallel.
Upvotes: 1
Reputation: 9012
disclaimer I am the author of borb
, the library used in this answer
Depending on what exactly is colored in the page, you could use borb
to get this done.
borb
has the concept of EventListener
, which gets notified of rendering instructions (as they are coming out of the parser).
This should be as fast as simply reading the PDF.
edit: based on your comment, I am including links to the following examples.
These examples might seem lengthy, but they are complete (in the sense that they will first create the PDF to later extract content/information from)
Upvotes: 0
Reputation: 11
try use this
import PyPDF2
pdf_file = open('nama_file.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
colored_page_count = 0
for page in pdf_reader.pages:
if page.get("/ColorSpace") == "/DeviceRGB":
colored_page_count += 1
print(colored_page_count)
pdf_file.close()
Upvotes: 0