Abdel Hana
Abdel Hana

Reputation: 91

Extract an image from a PDF in python

I'm trying to extract images from a pdf using PyPDF2, but when my code gets it, the image is very different from what it should actually look like, look at the example below:

Text But this is how it should really look like:

Text

Here's the pdf I'm using:

https://www.hbp.com/resources/SAMPLE%20PDF.pdf

Here's my code:

pdf_filename = "SAMPLE.pdf"
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
page = cond_scan_reader.getPage(0)

xObject = page['/Resources']['/XObject'].getObject()
i = 0
for obj in xObject:
    # print(xObject[obj])
    if xObject[obj]['/Subtype'] == '/Image':
        if xObject[obj]['/Filter'] == '/DCTDecode':
            data = xObject[obj]._data
            img = open("{}".format(i) + ".jpg", "wb")
            img.write(data)
            img.close()
            i += 1 

And since I need to keep the image in it's colour mode, I can't just convert it to RBG if it was CMYK because I need that information. Also, I'm trying to get dpi from images I get from a pdf, is that information always stored in the image? Thanks in advance

Upvotes: 4

Views: 3851

Answers (2)

Maksym Polshcha
Maksym Polshcha

Reputation: 18368

I used pdfreader to extract the image from your example. The image uses ICCBased colorspace with the value of N=4 and Intent value of RelativeColorimetric. This means that the "closest" PDF colorspace is DeviceCMYK.

All you need is to convert the image to RGB and invert the colors.

Here is the code:

from pdfreader import SimplePDFViewer
import PIL.ImageOps 

fd = open("SAMPLE PDF.pdf", "rb")
viewer = SimplePDFViewer(fd)

viewer.render()
img = viewer.canvas.images['Im0']

# this displays ICCBased 4 RelativeColorimetric
print(img.ColorSpace[0], img.ColorSpace[1].N, img.Intent)

pil_image = img.to_Pillow()
pil_image = pil_image.convert("RGB")
inverted = PIL.ImageOps.invert(pil_image)


inverted.save("sample.png")

Read more on PDF objects: Image (sec. 8.9.5), InlineImage (sec. 8.9.7)

Upvotes: 1

Bill Chen
Bill Chen

Reputation: 1749

Hope this works: you probably need to use another library such as Pillow:

Here is an example:


    from PIL import Image
    image = Image.open("path_to_image")
    if image.mode == 'CMYK':
        image = image.convert('RGB')
    image.write("path_to_image.jpg")

Reference: Convert from CMYK to RGB

Upvotes: 1

Related Questions