Larry
Larry

Reputation: 11949

Problems with simple Java DOM Parsing

Could someone please explain why this is happening. I have simplified my problem by created a simple program, but see details about the problem I’m facing:

String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<title text=\"title1\">\n" +
"    <comment id=\"comment1\">\n" +
"        <data> abcd </data>\n" +
"        <data> efgh </data>\n" +
"    </comment>\n" +
"    <comment id=\"comment2\">\n" +
"        <data> ijkl </data>\n" +
"        <data> mnop </data>\n" +
"        <data> qrst </data>\n" +
"    </comment>\n" +
"</title>\n";

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader(xml)));

System.out.println(doc.getFirstChild().getNodeName());
System.out.println(doc.getFirstChild().getFirstChild().getNodeName());

The corresponding output it:

title
#text

Firstly, why can’t I get the comment node?

Secondly, why does the data node get interpreted as a #text node?

What would be the correct and simple way to get the required nodes. Please also note that the XML file is not fixed; I want an arbitrary solution. thanks.

EDIT:

I get a similar problem when using Xpath, see the code below:

XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
XPathExpression expr = xpath.compile("/title/comment/data/text()");
NodeList result = (NodeList) expr.evaluate(msg.document(), XPathConstants.NODESET);
for(int i = 0; i < result.getLength(); i++)
    System.out.println(result.item(i).getNodeName() + " : " + result.item(i).getNodeValue());

This gives the output:

#text :  abcd 
#text :  efgh 
#text :  ijkl 
#text :  mnop 
#text :  qrst 

Upvotes: 1

Views: 1929

Answers (2)

Stephen C
Stephen C

Reputation: 719641

@JB Nizet's explanation of what is happening is correct.

One possible workaround would be to configure the parser to ignore "ignorable whitespace" by calling setIgnoringElementContentWhitespace() on the DocumentBuilderFactory. I understand that this will cause the parse to not generate those unwanted Text nodes for the whitespace between the tags.

Upvotes: 0

JB Nizet
JB Nizet

Reputation: 692211

The first node of the title node is a text node containing the \n and the four spaces before the <comment> element starts.

To get the comment node, ask its parent for its second node, or for its first element by tag name "comment". You may also loop through the childs and return the first node of type ELEMENT_NODE.

<data> is an element node containing a text node. The value of the text node is " abcd ".

Upvotes: 2

Related Questions