Personman
Personman

Reputation: 2323

Java: Ignoring escapes when parsing XML

I'm using a DocumentBuilder to parse XML files. However, the specification for the project requires that within text nodes, strings like &quot; and &lt; be returned literally, and not decoded as characters (" and <).

A previous similar question, Read escaped quote as escaped quote from xml, received one answer that seems to be specific to Apache, and another that appears to simply not not do what it says it does. I'd love to be proven wrong on either count, however :)

For reference, here is some code:

  file = new File(fileName);
  DocBderFac = DocumentBuilderFactory.newInstance();
  DocBder = DocBderFac.newDocumentBuilder();
  doc = DocBder.parse(file);

  NodeList textElmntLst = doc.getElementsByTagName(text);
  Element textElmnt = (Element) textElmntLst.item(0);

  NodeList txts = textElmnt.getChildNodes(); 
  String txt = ((Node) txts.item(0)).getNodeValue();
  System.out.println(txt);

I would like that println() to produce things like

&quot;3&gt;2&quot;

instead of

"3>2"

which is what currently happens. Thanks!

Upvotes: 4

Views: 4554

Answers (4)

Don Roby
Don Roby

Reputation: 41135

I'm using a DocumentBuilder to parse XML files. However, the specification for the project requires that within text nodes, strings like &quot; and &lt; be returned literally, and not decoded as characters (" and <).

Bad requirement. Don't do that.

Or at least consider carefully why you think you want or need it.

CDATA sections and escapes are a tactic for allowing you to pass text like quotes and '<' characters through XML and not have XML confuse them with markup. They have no meaning in themselves and when you pull them out of the XML, you should accept them as the quotes and '<' characters they were intended to represent.

Upvotes: 2

Personman
Personman

Reputation: 2323

Both good answers, but both a little too heavy-weight for this very small-scale application. I ended up going with the total hack of just stripping out all &s (I do this to &s that aren't part of escapes later anyway). It's ugly, but it's working.

Edit: I understand there's all kinds of things wrong with this, and that the requirement is stupid. It's for a school project, all that matters is that it work in one case, and the requirement is not my fault :)

Upvotes: -3

Bozho
Bozho

Reputation: 597342

You can turn them back into xml-encoded form by

 StringEscapeUtils.escapeXml(str);

(javadoc, commons-lang)

Upvotes: 3

John
John

Reputation: 6795

One approach might be to try dom4j, and to use the Node.asXML() method. It might return a deep structure, so it might need cloning to get just the node or text you want without any of its children.

Upvotes: 1

Related Questions