Adam
Adam

Reputation: 1090

Converting a multi-page PDF into single-page PDFs and extracting image

I have a multi-page PDF file that has information I need to parse. The information and picture is confined to its own page. I need to extract the text and image from the PDF.

I'm using CentOS and PHP.

My attempt:

I originally tried using a combination of pdftotext and imagemagick. I converted the PDF into an image and that actually separated the pages into their own images. Unfortunately the quality of the image on the page came out very poor.

My goal:

I need to split the PDF into multiple PDFs, one per page. Then, I need to extract the image from that page with the best quality possible.

Thanks.

Upvotes: 0

Views: 1398

Answers (2)

porg
porg

Reputation: 1291

pdfseparate to split multi-page.pdf to 1.pdf 2.pdf … + convert 1.pdf 1.png …

pdfseparate (part of poppler) to split multi-page.pdf to 1.pdf 2.pdf …

pdfseparate multi-page.pdf ./single-pages/%d.pdf
  • extracts all pages from multi-page.pdf
  • and saves them as single page PDFs, (%d variable for page number)

mogrify (part of ImageMagick) to batch convert all single page PDFs to PNGs at your desired resolution (in DPI)

mogrify ./single-pages/*.pdf -density 300 -format png

Upvotes: -1

Dingo
Dingo

Reputation: 2765

imagemagick does not fit to perform this task

when you need to extract images from a pdf, at their original size (i.e. the best, since any other resolution is or lesser or bigger than original), you must to use

pdfimages

http://www.foolabs.com/xpdf/download.html

(static binaries are available if you cannot compile from source)

syntax:

pdfimages file.pdf image-root

the image resulting will have the extension .ppm , unless you add the switch -j to have jpeg images as output

Upvotes: 2

Related Questions