HelloWorld
HelloWorld

Reputation: 2355

Why does PDFBox return image dimension of size 0 x 0

To find the actual size taken by an image on a PDF, I use PDFBox, and I followed what is described in this SO answer. So basically I call

 // Computes the image actual location and dimensions
 PrintImageLocations renderer = new PrintImageLocations();

 for (int i = 0; i < pageLimit; ++i) {
        PDPage page = pdf.getPage(i);

        renderer.processPage(page);
 }

and the PrintImageLocations() is taken from this PDFBox code example.

Yet with a PDF document that I use for test (generated by GPL Ghostscript 910 (ps2write) from an image found on Wikipedia), the image size reported is 0 x 0 (although the PDF can be imported into Gimp or Libre Office Draw).

So I'd like to know if the code I am currently using is reliable or not to find image size, and what could make it not find the right image size ?

The PDF used for this test can be found here

==========

Edit : Following @Itai comment, it appears that the condition if ("Do".equals(operation)) gets not evaluated because there no such operation is invoked. Consequently the processOperator from the super class is invoked.

The only operations that are invoked are (I added System.err.println("Processing " + operation); before the condition in the overriden processOperator method) :

Processing q Processing cm Processing gs Processing q Processing re Processing W Processing n Processing rg Processing re Processing f Processing cs Processing scn Processing re Processing f Processing Q Processing Q

==========

Any hints appreciated,

Upvotes: 1

Views: 386

Answers (1)

mkl
mkl

Reputation: 95918

As you already have found out yourself, the reason for the 0x0 output is that the code from PrintImageLocations as-is cannot find the image at all.

PrintImageLocations does not find the image because it only looks for image uses in the page content and in form XObjects (also nested) used in the page content. In the file at hand, on the other hand, the image is drawn inside a tiling Pattern content which is used to fill an area in the page content.

To allow PDFBox to find this image, we have to extend the PrintImageLocations class a bit to also descent into pattern content streams, e.g. like this:

class PrintImageLocationsImproved extends PrintImageLocations {
    public PrintImageLocationsImproved() throws IOException {
        super();

        addOperator(new SetNonStrokingColor());
        addOperator(new SetNonStrokingColorN());
        addOperator(new SetNonStrokingDeviceCMYKColor());
        addOperator(new SetNonStrokingDeviceGrayColor());
        addOperator(new SetNonStrokingDeviceRGBColor());
        addOperator(new SetNonStrokingColorSpace());
    }

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
        String operation = operator.getName();
        if (fillOperations.contains(operation)) {
            PDColor color = getGraphicsState().getNonStrokingColor();
            PDAbstractPattern pattern = getResources().getPattern(color.getPatternName());
            if (pattern instanceof PDTilingPattern) {
                processTilingPattern((PDTilingPattern) pattern, null, null);
            }
        }
        super.processOperator(operator, operands);
    }

    final List<String> fillOperations = Arrays.asList("f", "F", "f*", "b", "b*", "B", "B*");
}

(ExtractImageLocations inner class PrintImageLocationsImproved)

The tiling pattern in the document at hand is used as a pattern color for filling, not stroking. Thus, PrintImageLocationsImproved has to register operator listeners for non-stroking color operators to have the fill color correctly updated in the graphics state.

processOperator before delegating to the PrintImageLocations implementation now first checks whether the operator is a fill operation. In that case it inspects the current fill color. If it is a pattern color, processOperator initiates the processTilingPattern handling defined in PDFStreamEngine which starts a nested analysis of the pattern content stream and so eventually lets the PrintImageLocationsImproved find the image.

Using PrintImageLocationsImproved like this

try (   PDDocument document = PDDocument.load(...)    )
{
    PrintImageLocations printer = new PrintImageLocationsImproved();
    int pageNum = 0;
    for( PDPage page : document.getPages() )
    {
        pageNum++;
        System.out.println( "Processing page: " + pageNum );
        printer.processPage(page);
    }
}

(ExtractImageLocations test testExtractLikeHelloWorldImprovedFromTopSecret)

for your PDF file, therefore, will find the image:

Processing page: 1
*******************************************************************
Found image [R8]
position in PDF = 39.0, 102.48 in user space units
raw image size  = 1209, 1640 in pixels
displayed size  = 516.3119, 700.3752 in user space units
displayed size  = 7.1709986, 9.727433 in inches at 72 dpi rendering
displayed size  = 182.14336, 247.0768 in millimeters at 72 dpi rendering

Beware,

this is not not perfect fix, more a proof-of-concept and work-around, as it does neither properly restrict the pattern to the area actually filled nor return multiple finds for an area large enough to require multiple pattern tiles to fill. Nonetheless, it returns an image match for the file at hand..

Upvotes: 1

Related Questions