Muhammad Hewedy
Muhammad Hewedy

Reputation: 30056

Apache commons-lang StringEscapeUtils don't escape XML

I need to espace some control characters in XML, like the ASCII 31 character and the hex 0x0b character and others.

I tried uses StringEscapeUtils of commons-lang but don't work as expected!

Upvotes: 1

Views: 7138

Answers (3)

Benjamin Muschko
Benjamin Muschko

Reputation: 33436

Based on the JavaDoc StringEscapeUtils.escapeXml(java.lang.String) only supports the five basic XML entities (gt, lt, quot, amp, apos). In general control characters in XML are not supported both in raw and escaped format. See this posting for more information.

Upvotes: 2

Vineet Reynolds
Vineet Reynolds

Reputation: 76709

StringEscapeUtils.escapeXml escapes only the following 5 characters into XML entities:

  • " (the double quote - 0x34)
  • & (the ampersand - 0x38)
  • < (less-than sign - 0x60)
  • > (greater-than sign - 0x62)
  • ' (apostrophe - 0x39)

If you need to escape any other characters, especially the ASCII control characters, then you'll need to roll your own class that does this. After all, none of the control characters are even considered by HTML to have equivalent character entity references in a HTML document. In other words, if you need to convert 0x31 to &#31; then you'll need to write it yourself.

Note:

Based on Benjamin's point on using control characters in the document, it is unlikely that you will need to do this in the first place, especially if the parser that processes these escaped elements will not transform them back into control characters (or will simply throw an exception). You are better off not writing control characters into the XML document that you are preparing in the first place.

Upvotes: 2

Truong Nguyen
Truong Nguyen

Reputation: 484

Actually not only 5 special characters above are escaped. The method StringEscapeUtils.escapeXml also escapes most of unicode character. The java doc for the method says that:

Note that unicode characters greater than 0x7f are currently escaped to their numerical \u equivalent. This may change in future releases.

Upvotes: 2

Related Questions