Reputation: 45408
I have some Java (5.0) code that constructs a DOM from various (cached) data sources, then removes certain element nodes that are not required, then serializes the result into an XML string using:
// Serialize DOM back into a string
Writer out = new StringWriter();
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
tf.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
tf.setOutputProperty(OutputKeys.INDENT, "no");
tf.transform(new DOMSource(doc), new StreamResult(out));
return out.toString();
However, since I'm removing several element nodes, I end up with a lot of extra whitespace in the final serialized document.
Is there a simple way to remove/collapse the extraneous whitespace from the DOM before (or while) it's serialized into a String?
Upvotes: 19
Views: 44844
Reputation: 28101
I did it like this
private static final Pattern WHITESPACE_PATTERN = Pattern.compile("\\s*", Pattern.DOTALL);
private void removeWhitespace(Document doc) {
LinkedList<NodeList> stack = new LinkedList<>();
stack.add(doc.getDocumentElement().getChildNodes());
while (!stack.isEmpty()) {
NodeList nodeList = stack.removeFirst();
for (int i = nodeList.getLength() - 1; i >= 0; --i) {
Node node = nodeList.item(i);
if (node.getNodeType() == Node.TEXT_NODE) {
if (WHITESPACE_PATTERN.matcher(node.getTextContent()).matches()) {
node.getParentNode().removeChild(node);
}
} else if (node.getNodeType() == Node.ELEMENT_NODE) {
stack.add(node.getChildNodes());
}
}
}
}
Upvotes: 0
Reputation: 1
The following code works:
public String getSoapXmlFormatted(String pXml) {
try {
if (pXml != null) {
DocumentBuilderFactory tDbFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder tDBuilder;
tDBuilder = tDbFactory.newDocumentBuilder();
Document tDoc = tDBuilder.parse(new InputSource(
new StringReader(pXml)));
removeWhitespaces(tDoc);
final DOMImplementationRegistry tRegistry = DOMImplementationRegistry
.newInstance();
final DOMImplementationLS tImpl = (DOMImplementationLS) tRegistry
.getDOMImplementation("LS");
final LSSerializer tWriter = tImpl.createLSSerializer();
tWriter.getDomConfig().setParameter("format-pretty-print",
Boolean.FALSE);
tWriter.getDomConfig().setParameter(
"element-content-whitespace", Boolean.TRUE);
pXml = tWriter.writeToString(tDoc);
}
} catch (RuntimeException | ParserConfigurationException | SAXException
| IOException | ClassNotFoundException | InstantiationException
| IllegalAccessException tE) {
tE.printStackTrace();
}
return pXml;
}
public void removeWhitespaces(Node pRootNode) {
if (pRootNode != null) {
NodeList tList = pRootNode.getChildNodes();
if (tList != null && tList.getLength() > 0) {
ArrayList<Node> tRemoveNodeList = new ArrayList<Node>();
for (int i = 0; i < tList.getLength(); i++) {
Node tChildNode = tList.item(i);
if (tChildNode.getNodeType() == Node.TEXT_NODE) {
if (tChildNode.getTextContent() == null
|| "".equals(tChildNode.getTextContent().trim()))
tRemoveNodeList.add(tChildNode);
} else
removeWhitespaces(tChildNode);
}
for (Node tRemoveNode : tRemoveNodeList) {
pRootNode.removeChild(tRemoveNode);
}
}
}
}
Upvotes: 0
Reputation: 8677
Try using the following XSL and the strip-space
element to serialize your DOM:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
http://helpdesk.objects.com.au/java/how-do-i-remove-whitespace-from-an-xml-document
Upvotes: 8
Reputation: 3146
Another possible approach is to remove neighboring whitespace at the same time as you're removing the target nodes:
private void removeNodeAndTrailingWhitespace(Node node) {
List<Node> exiles = new ArrayList<Node>();
exiles.add(node);
for (Node whitespace = node.getNextSibling();
whitespace != null && whitespace.getNodeType() == Node.TEXT_NODE && whitespace.getTextContent().matches("\\s*");
whitespace = whitespace.getNextSibling()) {
exiles.add(whitespace);
}
for (Node exile: exiles) {
exile.getParentNode().removeChild(exile);
}
}
This has the benefit of keeping the rest of the existing formatting intact.
Upvotes: 0
Reputation: 5215
Below code deletes the comment nodes and text nodes with all empty spaces. If the text node has some value, value will be trimmed
public static void clean(Node node)
{
NodeList childNodes = node.getChildNodes();
for (int n = childNodes.getLength() - 1; n >= 0; n--)
{
Node child = childNodes.item(n);
short nodeType = child.getNodeType();
if (nodeType == Node.ELEMENT_NODE)
clean(child);
else if (nodeType == Node.TEXT_NODE)
{
String trimmedNodeVal = child.getNodeValue().trim();
if (trimmedNodeVal.length() == 0)
node.removeChild(child);
else
child.setNodeValue(trimmedNodeVal);
}
else if (nodeType == Node.COMMENT_NODE)
node.removeChild(child);
}
}
Ref: http://www.sitepoint.com/removing-useless-nodes-from-the-dom/
Upvotes: 5
Reputation: 3
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
This will retain xml indentation.
Upvotes: -3
Reputation: 1858
You can find empty text nodes using XPath, then remove them programmatically like so:
XPathFactory xpathFactory = XPathFactory.newInstance();
// XPath to find empty text nodes.
XPathExpression xpathExp = xpathFactory.newXPath().compile(
"//text()[normalize-space(.) = '']");
NodeList emptyTextNodes = (NodeList)
xpathExp.evaluate(doc, XPathConstants.NODESET);
// Remove each empty text node from document.
for (int i = 0; i < emptyTextNodes.getLength(); i++) {
Node emptyTextNode = emptyTextNodes.item(i);
emptyTextNode.getParentNode().removeChild(emptyTextNode);
}
This approach might be useful if you want more control over node removal than is easily achieved with an XSL template.
Upvotes: 40