Reputation: 1090
I have a multi-page PDF file that has information I need to parse. The information and picture is confined to its own page. I need to extract the text and image from the PDF.
I'm using CentOS and PHP.
My attempt:
I originally tried using a combination of pdftotext and imagemagick. I converted the PDF into an image and that actually separated the pages into their own images. Unfortunately the quality of the image on the page came out very poor.
My goal:
I need to split the PDF into multiple PDFs, one per page. Then, I need to extract the image from that page with the best quality possible.
Thanks.
Upvotes: 0
Views: 1398
Reputation: 1291
pdfseparate multi-page.pdf ./single-pages/%d.pdf
%d
variable for page number)mogrify ./single-pages/*.pdf -density 300 -format png
Upvotes: -1
Reputation: 2765
imagemagick does not fit to perform this task
when you need to extract images from a pdf, at their original size (i.e. the best, since any other resolution is or lesser or bigger than original), you must to use
pdfimages
http://www.foolabs.com/xpdf/download.html
(static binaries are available if you cannot compile from source)
syntax:
pdfimages file.pdf image-root
the image resulting will have the extension .ppm , unless you add the switch -j to have jpeg images as output
Upvotes: 2