stebeg
stebeg

Reputation: 165

How to created a formatted string from xml node in java

I'm trying to create a formatted string from an XML Node. See this example:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <parent>
        <foo>
            <bar>foo</bar>
        </foo>        
    </parent>
</root>

The Node I want to create a formatted string for is "foo". I expected a result like this:

<foo>
  <bar>foo</bar>
</foo>

But the actual result is:

<foo>
            <bar>foo</bar>
        </foo>

My approach looks like this:

public String toXmlString(Node node) throws TransformerException {
    final Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");     

    final Writer writer = new StringWriter();
    final StreamResult streamResult = new StreamResult(writer);

    transformer.transform(new DOMSource(node), streamResult);
    return writer.toString();
}

What am I doing wrong?

Upvotes: 0

Views: 305

Answers (4)

stebeg
stebeg

Reputation: 165

Based on kumesana's answer, I've found an acceptable solution:

public String toXmlString(Node node) throws TransformerException {
    final DOMBuilder builder = new DOMBuilder();
    final Element element = (Element) node;
    final org.jdom2.Element jdomElement = builder.build(element);

    final XMLOutputter xmlOutputter = new XMLOutputter(Format.getPrettyFormat());
    final String output = xmlOutputter.outputString(jdomElement);
    return output;
}

Upvotes: 0

Michael Kay
Michael Kay

Reputation: 163262

Saxon gives your desired output provided you strip whitespace on input:

    public void testIndentation() {
        try {
            String in = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
                    + "<root>\n"
                    + "    <parent>\n"
                    + "        <foo>\n"
                    + "            <bar>foo</bar>\n"
                    + "        </foo>        \n"
                    + "    </parent>\n"
                    + "</root>";
            Processor proc = new Processor(false);
            DocumentBuilder builder = proc.newDocumentBuilder();
            builder.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL); //XX
            XdmNode doc = builder.build(new StreamSource(new StringReader(in)));
            StringWriter sw = new StringWriter();
            Serializer serializer = proc.newSerializer(sw);
            serializer.setOutputProperty(Serializer.Property.METHOD, "xml");
            serializer.setOutputProperty(Serializer.Property.INDENT, "yes");
            XdmNode foo = doc.axisIterator(Axis.DESCENDANT, new QName("foo")).next();
            serializer.serializeNode(foo);
            System.err.println(sw);
        } catch (SaxonApiException err) {
            fail();
        }
    }

But if you don't strip whitespace (comment out line XX), you get the ragged output shown in your post. The spec, from XSLT 2.0 onwards, allows the processor to be smarter than this, but Saxon doesn't take advantage of this. One reason is that the serialization is entirely streamed: it's looking at each event (start element, end element, etc) in isolation rather than considering the document as a whole.

Upvotes: 0

kumesana
kumesana

Reputation: 2490

This will work better with the third party library JDOM 2, which also makes everything easier about manipulating DOM documents.

Its "pretty format" output will indent as expected, removing existing indentation, as long as the text nodes removed/altered were whitespace-only. When one wants to preserve whitespace, one doesn't ask for indented output.

Will look like this:

public String toXmlString(Element element) {
  return new XMLOutputter(Format.getPrettyFormat()).outputString(element);
}

Upvotes: 0

Jim Garrison
Jim Garrison

Reputation: 86744

It is doing exactly what it's supposed to do. indent="yes" allows the transform to add whitespace to indent elements, but not to remove whitespace, since it cannot know which whitespace in the input is significant.

In the input you provide, the <foo> and </foo> element lines have 8 leading blanks, and the <bar> line has 12.

The reason the <foo> opening tag is not indented is that the preceding whitespace actually belongs to the containing <parent> element and is not present in the subtree you passed to the transform.

Whitespace stripping behavior is covered in detail in the standards (XSLT 1, XSLT 2). In summary

A whitespace text node is preserved if either of the following apply:

  • The element name of the parent of the text node is in the set of whitespace-preserving element names
  • ...

and

(XSLT 2) The set of whitespace-preserving element names is specified by xsl:strip-space and xsl:preserve-space declarations. Whether an element name is included in the set of whitespace-preserving names is determined by the best match among all the xsl:strip-space or xsl:preserve-space declarations: it is included if and only if there is no match or the best match is an xsl:preserve-space element.

stated more simply in the XSLT 1 spec:

Initially, the set of whitespace-preserving element names contains all element names.

Unfortunately, using xsl:strip-space does not produce the results you want. With <xsl:strip-space elements="*"> (and indent="yes") I get the following output:

<foo><bar>foo</bar>
</foo>

Which makes sense. Whitespace is stripped, and then the </foo> tag is made to line up under its opening tag.

Upvotes: 1

Related Questions