How to extract XML from XFA PDF document in Java using iText 7 (or other)?

Question

Using Java and iText 7, I am trying to exact the XML data from a XFA PDF form in order to parse (and possibly modify) the data but all I can manage to do is grab some basic generic data that is the same for any XFA file I use.

I know it has to be possible since it is done in the iText RUPS tool but I have been going in circles for days now.

public class Parse {

    private PdfDocument pdf;
    private PdfAcroForm form;
    private XfaForm xfa;
    private Document domDocument;
    private Map data;
    private int numberOfPages;
    private String pdfText;

    public void openPdf(String src, String dest) throws IOException, TransformerException {

        PdfReader reader = new PdfReader(src);
        reader.setUnethicalReading(true);
        pdf = new PdfDocument(reader, new PdfWriter(dest));
        form = PdfAcroForm.getAcroForm(pdf, true);

        data = new HashMap();
        numberOfPages = getNumberOfPdfPages();
        PdfPage currentPage;
        String textFromPage;

        for (int page = 1; page <= numberOfPages; page++) {
            System.out.println("Reading page: " + page + " -----------------");
            currentPage = pdf.getPage(page);
            textFromPage = PdfTextExtractor.getTextFromPage(currentPage);
            data.put(page, textFromPage);
            pdfText += currentPage + ":" + "
" + textFromPage + "
";
        }


        xfa = form.getXfaForm();
        domDocument = xfa.getDomDocument();
        Map map = xfa.extractXFANodes(domDocument);

        System.out.println("The template node = " + map.get("template").toString() + "
");
        System.out.println("Dom document = " + domDocument.toString() + "
");
        System.out.println("In map form = " + map.toString() + "
");   
        System.out.println("pdfText = " + pdfText + "
");

        Node node = xfa.getDatasetsNode();
        NodeList list = node.getChildNodes();

        for (int i = 0; i < list.getLength(); i++) {
            System.out.println("Get Child Nodes Output = " + list.item(i) + "
");
        }

    }
}

This is the generic output I am receiving.

Reading page: 1 -----------------
The template node = [template: null]

Dom document = [#document: null]

In map form = {template=[template: null], form=[form: null], xfdf=[xfdf: null], xmpmeta=[x:xmpmeta: null], datasets=[xfa:datasets: null], config=[config: null], PDFSecurity=[PDFSecurity: null]}

pdfText = nullcom.itextpdf.kernel.pdf.PdfPage@6fa38a:

> Please wait... 
> 
> If this message is not eventually replaced by the proper contents of
> the document, your PDF  viewer may not be able to display this type of
> document.     You can upgrade to the latest version of Adobe Reader
> for Windows®, Mac, or Linux® by  visiting 
> http://www.adobe.com/go/reader_download.     For more assistance with
> Adobe Reader visit  http://www.adobe.com/go/acrreader.     Windows is
> either a registered trademark or a trademark of Microsoft Corporation
> in the United States and/or other countries. Mac is a trademark  of
> Apple Inc., registered in the United States and other countries. Linux
> is the registered trademark of Linus Torvalds in the U.S. and other 
> countries.

Get Child Nodes Output = [xfa:data: null]

Bruno Lowagie · Accepted Answer

You have a file that is a pure XFA file. This means that the only PDF content that is stored in this file consists of the "Please wait..." message. That page is shown in PDF viewer that don't know how to render XFA.

It's also the content you get when you extract the content from the page using:

currentPage = pdf.getPage(page);
textFromPage = PdfTextExtractor.getTextFromPage(currentPage);

This is something you shouldn't do when facing a pure XFA file, because all the relevant content is stored in the XML stream that is stored inside the PDF file.

You already have the first part right:

xfa = form.getXfaForm();
domDocument = xfa.getDomDocument();

The XFA stream is to be found in the /AcroForm entry. I know this is awkward, but that's how PDF was designed. That's not our choice, and XFA is deprecated in PDF 2.0, so XFA is dying anyway. The problem will disappear when XFA is finally dead and buried.

This being said, you have an instance of a org.w3c.dom.Document and you want to get the XML file stored in this object. You don't need iText to do this. That's explained for instance in Converting a org.w3c.dom.Document in Java to String using Transformer

I tested that code on an XFA file using this snippet:

public static void main(String[] args) throws IOException, TransformerException {
    PdfDocument pdf = new PdfDocument(new PdfReader(SRC));
    PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
    XfaForm xfa = form.getXfaForm();
    Document doc = xfa.getDomDocument();
    DOMSource domSource = new DOMSource(doc);
    StringWriter writer = new StringWriter();
    StreamResult result = new StreamResult(writer);
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer();
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.transform(domSource, result);
    writer.flush();
    System.out.println(writer.toString());
}

The output to the screen was the XDP XML file with all the XFA information I expected.

Note that I would be careful when replacing the XFA XML file. It's better not to meddle with the XFA structure, but to create an XML file containing nothing but the data created using the appropriate schema, and to fill the form as described in the FAQ: How to fill out a pdf file programmatically? (Dynamic XFA)

How to extract XML from XFA PDF document in Java using iText 7 (or other)?

Answers (1)

Related Questions