Reputation: 60689
I have a PDF which includes text and images. I want to extract images from the PDF using the linux command line. I can use pdfimages
to extract the images, but I also want to find the location on each page where that image is. pdfimages
can tell me what page each image (from the filename), however that's all it gives me. Is there any other FLOSS tool that can do this?
Upvotes: 16
Views: 9543
Reputation: 17350
Well I think the PDF must contain the info for placing them, so this should be possible. On the other hand a solution can be e.g.:
pdftoppm
pdfimages
cvCvtColor
matchTemplate
Step 1 may look similar to this Step 2:
for i in {0..99} ; do pdfimages -f $((i)) -l $((i+1)) file.pdf page$((i)); done
Step 3 here* a simple example
In Step 4 you should not have problems with training, because the image will be an exact match. matchTemplate( imageToSearch, pdfPageImg, outputMap, 'CV_TM_SQDIFF')
(* - link removed as it now appears to be pointing towards a ransomware site)
Upvotes: 17
Reputation: 111
There's an -xml
switch for the pdftohtml
command which will give image position, dimension and source information.
pdftohtml -xml file.pdf
Upvotes: 11
Reputation: 3184
There is no guarantee in PDF that if an image is reused it will not be a separate image. There is very little image metadata in a PDF file beyond the page location and its actual size on the page. I wrote an article explaining how images are stored inside a PDF at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/
Upvotes: 6