How to Extract Images from a PDF Form with iText

Question

This article (How to extract images from a PDF with iText in the correct order?) explains how to pull images from a regular PDF file. I need to extract an image that a user has entered into a PDF form field.

I use iText 7. I can access the form fields in iText with code like this:

PdfReader reader = new PdfReader(new FileInputStream(new ClassPathResource("myFile.pdf").getFile()));
PdfDocument document = new PdfDocument(reader);
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(document, false);
Map fields = acroForm.getFormFields();
PdfButtonFormField imageField = null;
PdfDictionary dictionary = null;
for (String fldName : fields.keySet()) {
      PdfFormField field = fields.get(fldName);
      if ("Image1_af_image".equals(fldName)) {
            imageField = (PdfButtonFormField)fields.get("Image1_af_image");
            dictionary = imageField.getPdfObject();
       }
}

where Image1_af_imgage is the default name of an image field in the form. Is it possible to extract an image stream from the PdfButtonFormField or its associated dictionary object?

Thank your for your very helpful response. I have incorporated your code as follows:

    public void iTextTest3() throws IOException {

        PdfReader reader = new PdfReader(new FileInputStream(new ClassPathResource("templates/TestForm.pdf").getFile()));

        PdfDocument document = new PdfDocument(reader);
        String fieldname = "Image1_af_image";
        PdfAcroForm acroForm = PdfAcroForm.getAcroForm(document, false);

        PdfFormField imagefield = acroForm.getField(fieldname);
        // get the appearance dictionary
        PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
        // get the xobject resources
        PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
        for (PdfName key : xObjDic.keySet()) {
            System.out.println(key);
            PdfStream s = xObjDic.getAsStream(key);
            // only process images
            if (PdfName.Image.equals(s.getAsName(PdfName.Subtype))) {  //*** code fails here ***
                PdfImageXObject pixo = new PdfImageXObject(s);
                byte[] imgbytes = pixo.getImageBytes();
                String ext = pixo.identifyImageFileExtension();

                // write the image to file
                String fileName = null;
                FileOutputStream fos = new FileOutputStream(fileName = key.toString().substring(1) + "." + ext);
                System.out.println(("image fileName: " + fileName));
                fos.write(imgbytes);
                fos.close();
            }
        }
        document.close();
    }

The code fails because s.getAsName(PdfName.Subtype) returns the value "Form". I'm guessing that what I need to do is recurse into the XObject tree as you suggest in your post but am not sure just how to do that. I tried xObjDic.getAsDictionary() but am not sure what PdfName to pass in as an argument.

rhens · Accepted Answer

The visual appearance of a button in PDF can be fully customized, with text, graphics and images. So, the image data could be stored in a slightly different way in different PDF documents. But generally speaking, the form field's widget annotation will have an appearance stream, which will have the image data as an XObject in its Resources dictionary.

Creating a PDF with a button with image for testing:

String fieldname = "Image1_af_image";
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
PdfButtonFormField imagefield = PdfFormField.createButton(pdfDoc, new Rectangle(100, 100, 50, 50),
        PdfButtonFormField.FF_PUSH_BUTTON);
imagefield.setImage("button.png").setFieldName(fieldname);
form.addField(imagefield);

Getting the image data from a button:

PdfAcroForm acroForm = PdfAcroForm.getAcroForm(pdfDoc, false);
PdfFormField imagefield = acroForm.getField(fieldname);
// get the appearance dictionary
PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
// get the xobject resources
PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
for (PdfName key : xObjDic.keySet()) {
    System.out.println(key);
    PdfStream s = xObjDic.getAsStream(key);
    // only process images
    if (PdfName.Image.equals(s.getAsName(PdfName.Subtype))) {
        PdfImageXObject pixo = new PdfImageXObject(s);
        byte[] imgbytes = pixo.getImageBytes();
        String ext = pixo.identifyImageFileExtension();
    
        // write the image to file
        FileOutputStream fos = new FileOutputStream(key.toString().substring(1) + "." + ext);
        fos.write(imgbytes);
        fos.close();
    }
}

You can use a PDF object viewer, such as iText RUPS or Adobe Acrobat's built-in "Browse Internal PDF Structure", to inspect the exact structure of your PDF document and find out where the image data is stored.

EDIT:

A more generic way of extracting the image data, in case it's in nested Form XObjects:

PdfAcroForm acroForm = PdfAcroForm.getAcroForm(pdfDoc, false);
PdfFormField imagefield = acroForm.getField(fieldname);
// get the appearance dictionary
PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
// get the xobject resources
PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
extractImagesFromXObj(xObjDic);

public void extractImagesFromXObj(PdfDictionary xObjDic) throws IOException {
    for (PdfName key : xObjDic.keySet()) {
        System.out.println(key);
        PdfStream s = xObjDic.getAsStream(key);
        PdfName subType = s.getAsName(PdfName.Subtype);
        // only process images
        if (PdfName.Image.equals(subType)) {
            PdfImageXObject pixo = new PdfImageXObject(s);
            byte[] imgbytes = pixo.getImageBytes();
            String ext = pixo.identifyImageFileExtension();

            // write the image to file
            FileOutputStream fos = new FileOutputStream(key.toString().substring(1) + "." + ext);
            fos.write(imgbytes);
            fos.close();
        }
        // process nested XObject dictionaries recursively
        else if (PdfName.Form.equals(subType)) {
            PdfDictionary nestedXObjDic = s.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
            extractImagesFromXObj(nestedXObjDic);
        }
    }
}

How to Extract Images from a PDF Form with iText

Answers (1)

Related Questions