Problems with simple Java DOM Parsing

Question

Could someone please explain why this is happening. I have simplified my problem by created a simple program, but see details about the problem I’m facing:

String xml = "
" +
"
" +
"    <comment id="comment1">
" +
"        <data> abcd </data>
" +
"        <data> efgh </data>
" +
"    </comment>
" +
"    <comment id="comment2">
" +
"        <data> ijkl </data>
" +
"        <data> mnop </data>
" +
"        <data> qrst </data>
" +
"    </comment>
" +
"
";

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader(xml)));

System.out.println(doc.getFirstChild().getNodeName());
System.out.println(doc.getFirstChild().getFirstChild().getNodeName());

The corresponding output it:

title
#text

Firstly, why can’t I get the comment node?

Secondly, why does the data node get interpreted as a #text node?

What would be the correct and simple way to get the required nodes. Please also note that the XML file is not fixed; I want an arbitrary solution. thanks.

EDIT:

I get a similar problem when using Xpath, see the code below:

XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
XPathExpression expr = xpath.compile("/title/comment/data/text()");
NodeList result = (NodeList) expr.evaluate(msg.document(), XPathConstants.NODESET);
for(int i = 0; i < result.getLength(); i++)
    System.out.println(result.item(i).getNodeName() + " : " + result.item(i).getNodeValue());

This gives the output:

#text :  abcd 
#text :  efgh 
#text :  ijkl 
#text :  mnop 
#text :  qrst

JB Nizet · Accepted Answer

The first node of the title node is a text node containing the and the four spaces before the element starts.

To get the comment node, ask its parent for its second node, or for its first element by tag name "comment". You may also loop through the childs and return the first node of type ELEMENT_NODE.

is an element node containing a text node. The value of the text node is " abcd ".

Problems with simple Java DOM Parsing

Answers (2)

Related Questions