Reputation: 11949
Could someone please explain why this is happening. I have simplified my problem by created a simple program, but see details about the problem I’m facing:
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<title text=\"title1\">\n" +
" <comment id=\"comment1\">\n" +
" <data> abcd </data>\n" +
" <data> efgh </data>\n" +
" </comment>\n" +
" <comment id=\"comment2\">\n" +
" <data> ijkl </data>\n" +
" <data> mnop </data>\n" +
" <data> qrst </data>\n" +
" </comment>\n" +
"</title>\n";
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader(xml)));
System.out.println(doc.getFirstChild().getNodeName());
System.out.println(doc.getFirstChild().getFirstChild().getNodeName());
The corresponding output it:
title
#text
Firstly, why can’t I get the comment
node?
Secondly, why does the data
node get interpreted as a #text
node?
What would be the correct and simple way to get the required nodes. Please also note that the XML file is not fixed; I want an arbitrary solution. thanks.
EDIT:
I get a similar problem when using Xpath, see the code below:
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
XPathExpression expr = xpath.compile("/title/comment/data/text()");
NodeList result = (NodeList) expr.evaluate(msg.document(), XPathConstants.NODESET);
for(int i = 0; i < result.getLength(); i++)
System.out.println(result.item(i).getNodeName() + " : " + result.item(i).getNodeValue());
This gives the output:
#text : abcd
#text : efgh
#text : ijkl
#text : mnop
#text : qrst
Upvotes: 1
Views: 1929
Reputation: 719641
@JB Nizet's explanation of what is happening is correct.
One possible workaround would be to configure the parser to ignore "ignorable whitespace" by calling setIgnoringElementContentWhitespace()
on the DocumentBuilderFactory
. I understand that this will cause the parse to not generate those unwanted Text nodes for the whitespace between the tags.
Upvotes: 0
Reputation: 692211
The first node of the title
node is a text node containing the \n
and the four spaces before the <comment>
element starts.
To get the comment node, ask its parent for its second node, or for its first element by tag name "comment". You may also loop through the childs and return the first node of type ELEMENT_NODE
.
<data>
is an element node containing a text node. The value of the text node is " abcd ".
Upvotes: 2