Sandman
Sandman

Reputation: 2745

I want to pretty print an org.w3c.dom.Document without a schema

i feel i'm going mad. I want to pretty print an org.w3c.dom.Document without a schema (in Java). Indentation is not all that i need, i want useless empty lines and whitespaces ignored. Somehow this doesn't happen, every time i parse an XML from a file or write it back to a file there are text nodes containing whitespace in the DOM document(\n, spaces, etc). Isn't there a way that i can get rid of these simply, without a schema and without transforming the XML myself by iterating over all the nodes and removing the empty text nodes?

Example: my input file looks like this (but with a lot more empty lines :)

<mytag>
       <anothertag>content</anothertag>



</mytag>

I would like my output file to look like this:

<mytag>
  <anothertag>content</anothertag>
</mytag>

Note: i don't have a schema for the XML (so i'm forced to call builder.setValidating(false)) and i don't have the luxury of an internet connection when this code is run.

Thanks!

UPDATE: i found something very close to what i need and maybe it helps other soldiers fighting against XML documents without schemas:

org.apache.axis.utils.XMLUtils.normalize(document);

Source code here. Calling this after the Document is created and before it's written with a Transformer will produce the pretty output with absolutely no schema validation. JB Nizet also gave me a working answer but i have the feeling some validation is going on behind the scenes of that code which would make it different than my use case. I leave the question open for a few days though in case someone has an even better solution.

Upvotes: 1

Views: 2503

Answers (1)

JB Nizet
JB Nizet

Reputation: 691765

Here's a working example:

public class Xml {
    private static final String XML =
        "<mytag>\n" +
        "        <anothertag>content</anothertag>\n" +
        "\n" +
        "\n" +
        "\n" +
        "</mytag>";

    public static void main(String[] args) throws ParserConfigurationException, IOException, SAXException, InstantiationException, IllegalAccessException, ClassNotFoundException {
        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        documentBuilderFactory.setValidating(false);
        DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
        Document document = documentBuilder.parse(new InputSource(new StringReader(XML)));

        NodeList childNodes = document.getDocumentElement().getChildNodes();
        for (int i = 0; i < childNodes.getLength(); i++) {
           System.out.println(childNodes.item(i));
        }

        final DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
        final DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
        final LSSerializer writer = impl.createLSSerializer();

        writer.getDomConfig().setParameter("xml-declaration", false);
        writer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);

        System.out.println(writer.writeToString(document));
    }
}

Output:

[#text: 
        ]
[anothertag: null]
[#text: 



]
<mytag>
    <anothertag>content</anothertag>
</mytag>

So, the parser doesn't validate, it preserves the text nodes, and the output produced by the serializer is as you expect it.

Upvotes: 4

Related Questions