drunkenfist
drunkenfist

Reputation: 3048

PDFBox not detecting image in page

I'm trying to detect images in this pdf using PDFBox. The pdf has two blank images, one on the left side (below the text "Put this IN the box") and the other on the right side (below the text "Affix this OUTSIDE the box"). This is the code I'm using to detect the images:

PDPage page = (PDPage) catalog.getAllPages().get(0);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List<Object> tokens = parser.getTokens();

PDResources resources = page.getResources();
Map<String, PDXObjectImage> images = resources.getImages();
if(null != images){
        Iterator<String> it = images.keySet().iterator();
        while(it.hasNext()){
            String key = it.next();
            System.out.println("Key >>>>>>>>>>>>>> "+key);
        }
}

I'm able to detect the second image. However, the first image is not being detected. What is the problem? I'm sure the pdf is proper. I created it multiple times, and still I'm facing the same problem. I created the pdf using Sketch.

Thanks.

Upvotes: 1

Views: 1188

Answers (1)

mkl
mkl

Reputation: 96039

In short

I'm able to detect the second image. However, the first image is not being detected. What is the problem?

Actually the same image resource is used for both on-page images, merely stretched to different dimensions.

In detail

If you look at the content stream of your page, you'll see this at the end:

q
720 0 0 970 832 126 cm
/Im1 Do
Q
q
512 0 0 128 144 968 cm
/Im1 Do
Q

The first four lines draw the image resource Im1 at position 832, 126 stretched to 720 x 970, and the last 4 lines draw the same image resource Im1 at position 144, 968 stretched to 512 x 128.

What to do

Your approach to merely look at the page resources to find on-page images is inappropriate because

  • as you have seen a single image resource may be used multiple times on page stretched to different dimensions,
  • an image resource may not be used at all on a page (e.g. some documents have one big resources dictionary referenced from all pages; for a given page many resources may not be used),
  • images can be inlined into the content stream; your approach would not see these images at all, and
  • form Xobjects or patterns may be displayed on your page which may have images in their own resources respectively; as you only look at image resources contained in the immediate page resources, your approach will not find them either.

A better solution (only failing for inlined and probably patterned images) is presented in the PDFBox sample PrintImageLocations the output of which for your file is

*******************************************************************
Found image [Im1]
position = 832.0, 128.0
size = 360px, 462px
size = 720.0, 970.0
size = 10.0in, 13.472222in
size = 254.0mm, 342.19446mm

*******************************************************************
Found image [Im1]
position = 144.0, 128.0
size = 360px, 462px
size = 512.0, 128.0
size = 7.111111in, 1.7777778in
size = 180.62222mm, 45.155556mm

This sample makes use of the PDFBox PDFStreamEngine to parse the content processed to draw a page.

Upvotes: 1

Related Questions