chris
chris

Reputation: 21

delete am image from a PDF file using PDFbox

I am attempting to delete images from a PDF using java and PDFbox. The images are not inline, and the PDF does not have patterns or forms. The pdf file contains 2 images. The PDFdebugger tool shows Resources >> XObject >> IM3 and IM5. The problem is: I display the output pdf file and the images are not deleted.

public class DeleteImage {
    public static void removeImages(String pdfFile) throws Exception {
        PDDocument document = PDDocument.load(new File(pdfFile));

        for (PDPage page : document.getPages()) {
            PDResources pdResources = page.getResources();
            pdResources.getXObjectNames().forEach(propertyName -> {
                if(!pdResources.isImageXObject(propertyName)) {
                    return;
                }
                PDXObject o;
                try {
                    o = pdResources.getXObject(propertyName);
                    if (o instanceof PDImageXObject) {
                        System.out.println("propertyName" + propertyName);
                        page.getCOSObject().removeItem(propertyName);
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            });

            for (COSName name :  page.getResources().getPatternNames()) {
                PDAbstractPattern pattern = page.getResources().getPattern(name);
                System.out.println("have pattern");
            }
              
            PDFStreamParser parser = new PDFStreamParser(page);
            parser.parse();
            List<Object> tokens = parser.getTokens();
            System.out.println("original tokens size" + tokens.size());
            List<Object> newTokens = new ArrayList<Object>();

            for(int j=0; j<tokens.size(); j++) {
                Object token = tokens.get( j );
                if( token instanceof Operator ) {
                    Operator op = (Operator)token;

                    System.out.println("operation" + op.getName());
                    //find image - remove it
                    if( op.getName().equals("Do") ) {
                        System.out.println("op equals Do");
                        newTokens.remove(newTokens.size()-1);
                        continue;
                    } else if ("BI".equals(op.getName())) {
                        System.out.println("inline -- op equals BI");
                    } else {
                        System.out.println("op not quals Do");
                    }
                }
                newTokens.add(token);
            }

            PDDocument newDoc = new PDDocument();
            PDPage newPage = newDoc.importPage(page);
            newPage.setResources(page.getResources());

            System.out.println("tokens size" + newTokens.size());
            PDStream newContents = new PDStream(newDoc);
            OutputStream out = newContents.createOutputStream();
            ContentStreamWriter writer = new ContentStreamWriter( out );
            writer.writeTokens( newTokens);
            out.close();
            newPage.setContents( newContents );
        }

        document.save("RemoveImage.pdf");
        document.close();
    }

    public static void remove(String pdfFile) throws Exception {
        PDDocument document = PDDocument.load(new File(pdfFile));
        PDResources resources = null;
        
        for (PDPage page : document.getPages()) {
            resources = page.getResources();

            for (COSName name : resources.getXObjectNames()) {
                PDXObject xobject = resources.getXObject(name);
                
                if (xobject instanceof PDImageXObject) {
                    System.out.println("have image");
                    removeImages(pdfFile);
                }
            }
        }
        document.save("RemoveImage.pdf");
        document.close();
    }
}

Upvotes: 2

Views: 2296

Answers (1)

mkl
mkl

Reputation: 95918

If You Call remove...

In remove you

  • load the PDF into document,
  • iterate over the pages of document, and for each page
    • iterate over the XObject resources, and for each Xobject
      • check whether it is an image Xobject, and if it is
        • call removeImages which loads the same original file, processes it, and saves the result as "RemoveImage.pdf".
  • After all that processing you save the unchanged document to "RemoveImage.pdf".

So in that last step you overwrite any changes you may have done in removeImages and end up with your original file in "RemoveImage.pdf"!

If You Call removeImages Directly...

In removeImages you do some changes but there are certain issues:

  • Whenever you find an image Xobject resource, you attempt to remove it from the page directly

    page.getCOSObject().removeItem(propertyName);
    

    but the image Xobject resource is not a direct child of the page, it is managed by pdResources, so you should remove it from there.

  • You remove all Do instructions from the page content, not only those for image Xobjects, so you probably remove more than you wanted.

Upvotes: 2

Related Questions