Reputation: 2745
i feel i'm going mad. I want to pretty print an org.w3c.dom.Document without a schema (in Java). Indentation is not all that i need, i want useless empty lines and whitespaces ignored. Somehow this doesn't happen, every time i parse an XML from a file or write it back to a file there are text nodes containing whitespace in the DOM document(\n, spaces, etc). Isn't there a way that i can get rid of these simply, without a schema and without transforming the XML myself by iterating over all the nodes and removing the empty text nodes?
Example: my input file looks like this (but with a lot more empty lines :)
<mytag>
<anothertag>content</anothertag>
</mytag>
I would like my output file to look like this:
<mytag>
<anothertag>content</anothertag>
</mytag>
Note: i don't have a schema for the XML (so i'm forced to call builder.setValidating(false)) and i don't have the luxury of an internet connection when this code is run.
Thanks!
UPDATE: i found something very close to what i need and maybe it helps other soldiers fighting against XML documents without schemas:
org.apache.axis.utils.XMLUtils.normalize(document);
Source code here. Calling this after the Document is created and before it's written with a Transformer will produce the pretty output with absolutely no schema validation. JB Nizet also gave me a working answer but i have the feeling some validation is going on behind the scenes of that code which would make it different than my use case. I leave the question open for a few days though in case someone has an even better solution.
Upvotes: 1
Views: 2503
Reputation: 691765
Here's a working example:
public class Xml {
private static final String XML =
"<mytag>\n" +
" <anothertag>content</anothertag>\n" +
"\n" +
"\n" +
"\n" +
"</mytag>";
public static void main(String[] args) throws ParserConfigurationException, IOException, SAXException, InstantiationException, IllegalAccessException, ClassNotFoundException {
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setValidating(false);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document document = documentBuilder.parse(new InputSource(new StringReader(XML)));
NodeList childNodes = document.getDocumentElement().getChildNodes();
for (int i = 0; i < childNodes.getLength(); i++) {
System.out.println(childNodes.item(i));
}
final DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
final DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
final LSSerializer writer = impl.createLSSerializer();
writer.getDomConfig().setParameter("xml-declaration", false);
writer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
System.out.println(writer.writeToString(document));
}
}
Output:
[#text:
]
[anothertag: null]
[#text:
]
<mytag>
<anothertag>content</anothertag>
</mytag>
So, the parser doesn't validate, it preserves the text nodes, and the output produced by the serializer is as you expect it.
Upvotes: 4