Reputation: 1815
<node> test
test
test
</node>
I want my XML parser read characters in <node>
and:
	
), newlines (

) or whitespaces (
) - they should be left.I'm trying a code below, but it preserve duplicated whitespaces.
dbf = DocumentBuilderFactory.newInstance();
dbf.setIgnoringComments( true );
dbf.setNamespaceAware( namespaceAware );
db = dbf.newDocumentBuilder();
doc = db.parse( inputStream );
Is the any way to do what I want?
Thanks!
Upvotes: 6
Views: 2232
Reputation: 108939
The first part - replacing multiple white-space - is relatively easy though I don't think the parser will do it for you:
InputSource stream = new InputSource(inputStream);
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = (Document) xpath.evaluate("/", stream, XPathConstants.NODE);
NodeList nodes = (NodeList) xpath.evaluate("//text()", doc,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
Text text = (Text) nodes.item(i);
text.setTextContent(text.getTextContent().replaceAll("\\s{2,}", " "));
}
// check results
TransformerFactory.newInstance()
.newTransformer()
.transform(new DOMSource(doc), new StreamResult(System.out));
This is the hard part:
If the node contains XML encoded characters: tabs (
	
), newlines (

) or whitespaces (
) - they should be left.
The parser will always turn "	"
into "\t"
- you may need to write your own XML parser.
According to the author of Saxon:
I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.
Upvotes: 1