Reputation: 3147

Python - Extract a PDF page as a jpeg

How can I efficiently save a particular page of a PDF as a jpeg file using Python?

I have a Python Flask web server where PDFs will be uploaded and I want to also store jpeg files that correspond to each PDF page.

This solution is close but it does not result in the entire page being converted to a jpeg.

Upvotes: 205

Answers (19)

mara004

Reputation: 2349

Using pypdfium2 (v4):

python3 -m pip install "pypdfium2==4" pillow

import pypdfium2 as pdfium

# Load a document
pdf = pdfium.PdfDocument("tests/resources/multipage.pdf")

# Loop over pages and render
for i in range(len(pdf)):
    page = pdf[i]
    image = page.render(scale=4).to_pil()
    image.save(f"output_{i:03d}.jpg")

Advantages:

PDFium is liberal-licensed (BSD 3-Clause, Apache 2.0, plus other liberal licenses of pdfium's third-party dependencies) [^1]
It is fast, outperforming Poppler. In terms of speed, pypdfium2 can almost reach PyMuPDF
Returns PIL.Image.Image, numpy.ndarray, or a ctypes array, depending on your needs
Is capable of processing encrypted (password-protected) PDFs
No mandatory runtime dependencies
Supports Python >= 3.6
Setup infrastructure complies with PEP 517/518

Wheels are currently available for:

Windows amd64, win32, arm64
macOS x86_64, arm64
Linux (glibc) x86_64, i686, aarch64, armv7l
Linux (musl) x86_64, i686, aarch64

There is a script to build from source, too.

Disclaimer: I'm the maintainer

[^1]: Note, this is not legal advice, and there is absolutely no warranty.

Upvotes: 54

Keval Dave

Reputation: 2987

The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for count, page in enumerate(pages):
    page.save(f'out{count}.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install [poppler for Windows] see ** below Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

** note: Windows 64 bit versions upto 24.08 are available at https://github.com/oschwartz10612/poppler-windows but note that for 32 bit 22.02 was the last one included in TeXLive 2022 (https://poppler.freedesktop.org/releases.html) so you'll not be getting the latest features or bug fixes.

Upvotes: 239

photek1944

Reputation: 139

Install poppler for Windows and use pdftoppm.exe as follows:

Download zip file with Poppler's latest binaries/dlls from http://blog.alivate.com.au/poppler-windows/ and unzip to a new folder in your program files folder. For example: "C:\Program Files (x86)\Poppler".
Add "C:\Program Files (x86)\Poppler\poppler-0.68.0\bin" to your SYSTEM PATH environment variable.
From cmd line install pdf2image module -> "pip install pdf2image".
Or alternatively, directly execute pdftoppm.exe from your code using Python's subprocess module as explained by user Basj.

This code should generate the jpgs you want through the subprocess module for all pages of one or more pdfs in a given folder:

import os, subprocess

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

pdftoppm_path = r"C:\Program Files (x86)\Poppler\poppler-0.68.0\bin\pdftoppm.exe"

for pdf_file in os.listdir(pdf_dir):
    
    if pdf_file.endswith(".pdf"):

        subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))

Or using the pdf2image module:

import os
from pdf2image import convert_from_path
        
pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)
        
    for pdf_file in os.listdir(pdf_dir):
        
        if pdf_file.endswith(".pdf"):
    
            pages = convert_from_path(pdf_file, 300)
            pdf_file = pdf_file[:-4]
    
            for page in pages:

               page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")

Upvotes: 13

dpacman

Reputation: 3897

Here is a function that does the conversion of a PDF file with one or multiple pages to a single merged JPEG image.

import os
import tempfile
from pdf2image import convert_from_path
from PIL import Image

def convert_pdf_to_image(file_path, output_path):
    # save temp image files in temp dir, delete them after we are finished
    with tempfile.TemporaryDirectory() as temp_dir:
        # convert pdf to multiple image
        images = convert_from_path(file_path, output_folder=temp_dir)
        # save images to temporary directory
        temp_images = []
        for i in range(len(images)):
            image_path = f'{temp_dir}/{i}.jpg'
            images[i].save(image_path, 'JPEG')
            temp_images.append(image_path)
        # read images into pillow.Image
        imgs = list(map(Image.open, temp_images))
    # find minimum width of images
    min_img_width = min(i.width for i in imgs)
    # find total height of all images
    total_height = 0
    for i, img in enumerate(imgs):
        total_height += imgs[i].height
    # create new image object with width and total height
    merged_image = Image.new(imgs[0].mode, (min_img_width, total_height))
    # paste images together one by one
    y = 0
    for img in imgs:
        merged_image.paste(img, (0, y))
        y += img.height
    # save merged image
    merged_image.save(output_path)
    return output_path

Example usage:

convert_pdf_to_image("path_to_Pdf/1.pdf", "output_path/output.jpeg")

Upvotes: 4

Christopher Creveling

Reputation: 351

I wrote this script to easily convert a folder directory that contains PDFs (single page) to PNGs really nicely.

import os
from pathlib import PurePath
import glob
# from PIL import Image
from pdf2image import convert_from_path
import pdb

# In[file list]

wd = os.getcwd()

# filter images
fileListpdf = glob.glob(f'{wd}//*.pdf')

# In[Convert pdf to images]

for i in fileListpdf:
    
    images = convert_from_path(i, dpi=300)
    
    path_split = PurePath(i).parts
    fileName, ext = os.path.splitext(path_split[-1])
    
    images[0].save(f'{fileName}.png', 'PNG')

Upvotes: 2

VaChinaka

Reputation: 9

For windows this was the easy way to install poppler for pdf2image

install pdf2image

pip install pdf2image

install chocolatey using this command in cmd run cmd as administrator

@"%SystemRoot%\System32\WindowsPowerShell\v1.0\powershell.exe" -NoProfile -InputFormat None -ExecutionPolicy Bypass -Command "[System.Net.ServicePointManager]::SecurityProtocol = 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))" && SET "PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin"

install poppler

choco install poppler

when its done take the installation path displayed and add to environment variables path

it will look like this

'C:\ProgramData\chocolatey\lib\poppler\tools'

i just added poppler dir as insurance policy

'C:\ProgramData\chocolatey\lib\poppler'

Upvotes: -1

Gustavo Lisi

Reputation: 141

Following pdf2image documentation in 2024. Just remember to install poppler

convert_from_path returns a list with all the pages of the pdf converted to .ppm, then define the file name and save the first page defined in image_list[0] as JPEG. If you want to save all pages, just iterate over image_list

import os
from pdf2image import convert_from_path

pdf_folder = 'path/to/pdfs'
img_folder = 'path/to/save/imgs'

for file in os.listdir(pdf_folder):
    if file.endswith('.pdf'):
        pdf_path = os.path.join(pdf_folder, file)

        with open(pdf_path, 'rb') as pdf_arquivo:
            name = os.path.splitext(file)[0]            
            image_list = convert_from_path(pdf_path, poppler_path='C:/Poppler/bin')
            img_path = os.path.join(img_folder, f'{name}.jpg')
            image_list[0].save(img_path, 'JPEG')

print("Finished!")

Upvotes: 0

JJPty

Reputation: 1627

I found this simple solution, PyMuPDF, output to png file. Note the library is imported as "fitz", a historical name for the rendering engine it uses.

import fitz

pdffile = "infile.pdf"
doc = fitz.open(pdffile)
page = doc.load_page(0)  # number of page
pix = page.get_pixmap()
output = "outfile.png"
pix.save(output)
doc.close()

Note: The library changed from using "camelCase" to "snake_cased". If you run into an error that a function does not exist, have a look under deprecated names. The functions in the example above have been updated accordingly.

The fitz.Document class supports a context manager initialization:

with fitz.open(pdffile) as doc:
   ...

Upvotes: 161

DevB2F

Reputation: 5095

There is no need to install Poppler on your OS. This will work:

pip install Wand

from wand.image import Image

f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source: 
    for i, image in enumerate(source.sequence):
        newfilename = f.removesuffix(".pdf") + str(i + 1) + '.jpeg'
        Image(image).save(filename=newfilename)

Upvotes: 17

Rajkumar

Reputation: 568

One problem everyone will face that is to Install Poppler. My way is a tricky way,but will work efficiently.

1st download Poppler here.

Then extract it and in the code section just add poppler_path=r'C:\Program Files\poppler-0.68.0\bin' (for eg.) like below

from pdf2image import convert_from_path
images = convert_from_path("mypdf.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
for i, image in enumerate(images):
    fname = 'image'+str(i)+'.png'
    image.save(fname, "PNG")

Upvotes: 4

Malki Mohamed

Reputation: 1688

This easy script can convert a folder directory that contains PDFs (single/multiple pages) to jpeg.

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
from os import listdir
from os import system
from os.path import isfile, join, basename, dirname
import shutil

def move_processed_file(file, doc_path, download_processed):
    try:
        shutil.move(doc_path + '/' + file, download_processed + '/' + file)
        pass
    except Exception as e:
        print(e.errno)
        raise
    else:
        pass
    finally:
        pass
    pass


def run_conversion():
    root_dir = os.path.abspath(os.curdir)

    doc_path = root_dir + r"\data\download"
    pdf_processed = root_dir + r"\data\download\pdf_processed"
    results_folder = doc_path

    files = [f for f in listdir(doc_path) if isfile(join(doc_path, f))]

    pdf_files = [f for f in listdir(doc_path) if isfile(join(doc_path, f)) and f.lower().endswith('.pdf')]

    # check OS type
    if os.name == 'nt':
        # if is windows or a graphical OS, change this poppler path with your own path
        poppler_path = r"C:\poppler-0.68.0\bin"
    else:
        poppler_path = root_dir + r"\usr\bin"

    for file in pdf_files:

        ''' 
        # Converting PDF to images 
        '''

        # Store all the pages of the PDF in a variable
        pages = convert_from_path(doc_path + '/' + file, 500, poppler_path=poppler_path)

        # Counter to store images of each page of PDF to image
        image_counter = 1

        filename, file_extension = os.path.splitext(file)

        # Iterate through all the pages stored above
        for page in pages:
            # Declaring filename for each page of PDF as JPG
            # PDF page n -> page_n.jpg
            filename = filename + '_' + str(image_counter) + ".jpg"

            # Save the image of the page in system
            page.save(results_folder + '/' + filename, 'JPEG')

            # Increment the counter to update filename
            image_counter += 1

        move_processed_file(file, doc_path, pdf_processed)

Upvotes: 0

Basj

Reputation: 46323

The Python library pdf2image (used in the other answer) in fact doesn't do much more than just launching pdttoppm with subprocess.Popen, so here is a short version doing it directly:

PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"

import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))

Here is the Windows installation link for pdftoppm (contained in a package named poppler): http://blog.alivate.com.au/poppler-windows/.

Upvotes: 31

madan maram

Reputation: 23

from pdf2image import convert_from_path

PDF_file = 'Statement.pdf'
pages = convert_from_path(PDF_file, 500,userpw='XXX')

image_counter = 1

for page in pages:

    filename = "foldername/page_" + str(image_counter) + ".jpg"
    page.save(filename, 'JPEG')
    image_counter = image_counter + 1

Upvotes: -3

SKG

Reputation: 163

For a pdf file with multiple pages, the following is the best & simplest (I used pdf2image-1.14.0):

from pdf2image import convert_from_path
from pdf2image.exceptions import (
     PDFInfoNotInstalledError,
     PDFPageCountError,
     PDFSyntaxError
     )
        
images = convert_from_path(r"path/to/input/pdf/file", output_folder=r"path/to/output/folder", fmt="jpg",) #dpi=200, grayscale=True, size=(300,400), first_page=0, last_page=3)
        
images.clear()

Note:

"images" is a list of PIL images.
The saved images in the output folder will have system generated names; one can later change them, if required.

Upvotes: 0

moo5e

Reputation: 63

Here is a solution which requires no additional libraries and is very fast. This was found from: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html# I have added the code in a function to make it more convenient.

def convert(filepath):
    with open(filepath, "rb") as file:
        pdf = file.read()

    startmark = b"\xff\xd8"
    startfix = 0
    endmark = b"\xff\xd9"
    endfix = 2
    i = 0

    njpg = 0
    while True:
        istream = pdf.find(b"stream", i)
        if istream < 0:
            break
        istart = pdf.find(startmark, istream, istream + 20)
        if istart < 0:
            i = istream + 20
            continue
        iend = pdf.find(b"endstream", istart)
        if iend < 0:
            raise Exception("Didn't find end of stream!")
        iend = pdf.find(endmark, iend - 20)
        if iend < 0:
            raise Exception("Didn't find end of JPG!")

        istart += startfix
        iend += endfix
        jpg = pdf[istart:iend]
        newfile = "{}jpg".format(filepath[:-3])
        with open(newfile, "wb") as jpgfile:
            jpgfile.write(jpg)

        njpg += 1
        i = iend

        return newfile

Call convert with the pdf path as the argument and the function will create a .jpg file in the same directory

Upvotes: -1

Keval Dave

Reputation: 2987

GhostScript performs much faster than Poppler for a Linux based system.

Following is the code for pdf to image conversion.

def get_image_page(pdf_file, out_file, page_num):
    page = str(page_num + 1)
    command = ["gs", "-q", "-dNOPAUSE", "-dBATCH", "-sDEVICE=png16m", "-r" + str(RESOLUTION), "-dPDFFitPage",
               "-sOutputFile=" + out_file, "-dFirstPage=" + page, "-dLastPage=" + page,
               pdf_file]
    f_null = open(os.devnull, 'w')
    subprocess.call(command, stdout=f_null, stderr=subprocess.STDOUT)

GhostScript can be installed on macOS using brew install ghostscript

Installation information for other platforms can be found here. If it is not already installed on your system.

Upvotes: 8

Saiprasad Bhatwadekar

Reputation: 11

from pdf2image import convert_from_path
import glob

pdf_dir = glob.glob(r'G:\personal\pdf\*')  #your pdf folder path
img_dir = "G:\\personal\\img\\"           #your dest img path

for pdf_ in pdf_dir:
    pages = convert_from_path(pdf_, 500)
    for page in pages:
        page.save(img_dir+pdf_.split("\\")[-1][:-3]+"jpg", 'JPEG')

Upvotes: 0

Robert

Reputation: 11

I use a (maybe) much simpler option of pdf2image:

cd $dir
for f in *.pdf
do
  if [ -f "${f}" ]; then
    n=$(echo "$f" | cut -f1 -d'.')
    pdftoppm -scale-to 1440 -png $f $conv/$n
    rm $f
    mv  $conv/*.png $dir
  fi
done

This is a small part of a bash script in a loop for the use of a narrow casting device. Checks every 5 seconds on added pdf files (all) and processes them. This is for a demo device, at the end converting will be done at a remote server. Converting to .PNG now, but .JPG is possible too.

This converting, together with transitions on A4 format, displaying a video, two smooth scrolling texts and a logo (with transition in three versions) sets the Pi3 to allmost 4x 100% cpu-load ;-)

Upvotes: 1

duck

Reputation: 2561

Their is a utility called pdftojpg which can be used to convert the pdf to img

You can found the code here https://github.com/pankajr141/pdf2jpg

from pdf2jpg import pdf2jpg
inputpath = r"D:\inputdir\pdf1.pdf"
outputpath = r"D:\outputdir"
# To convert single page
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1")
print(result)

# To convert multiple pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1,0,3")
print(result)

# to convert all pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="ALL")
print(result)

Upvotes: 5

Python - Extract a PDF page as a jpeg

Answers (19)

Related Questions