Amandasaurus
Amandasaurus

Reputation: 60689

Given a PDF, how to extract the images *and their locations on the page* from the command line?

I have a PDF which includes text and images. I want to extract images from the PDF using the linux command line. I can use pdfimages to extract the images, but I also want to find the location on each page where that image is. pdfimages can tell me what page each image (from the filename), however that's all it gives me. Is there any other FLOSS tool that can do this?

Upvotes: 16

Views: 9543

Answers (3)

Eric Fortis
Eric Fortis

Reputation: 17350

Well I think the PDF must contain the info for placing them, so this should be possible. On the other hand a solution can be e.g.:

  1. Convert each pdf page to an image with pdftoppm
  2. Extract the images from each page with pdfimages
  3. Convert the images to a single 8-bits grey-scale channel (for faster analysis) with cvCvtColor
  4. Object detection with matchTemplate

Step 1 may look similar to this Step 2:

for i in {0..99} ; do pdfimages -f $((i)) -l $((i+1)) file.pdf page$((i)); done

Step 3 here* a simple example

In Step 4 you should not have problems with training, because the image will be an exact match. matchTemplate( imageToSearch, pdfPageImg, outputMap, 'CV_TM_SQDIFF')

(* - link removed as it now appears to be pointing towards a ransomware site)

Upvotes: 17

someuser9809
someuser9809

Reputation: 111

There's an -xml switch for the pdftohtml command which will give image position, dimension and source information.

pdftohtml -xml file.pdf

Upvotes: 11

mark stephens
mark stephens

Reputation: 3184

There is no guarantee in PDF that if an image is reused it will not be a separate image. There is very little image metadata in a PDF file beyond the page location and its actual size on the page. I wrote an article explaining how images are stored inside a PDF at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/

Upvotes: 6

Related Questions