Reputation: 2089
I have the following XML snippet in a string (note the double encoded &):
...
<PARA>
S&amp;P
</PARA>
...
My desired output would be:
> ... <PARA> S&P </PARA> ...
If I use:
StringEscapeUtils.unescapeXml()
The actual oputput is:
> ... <PARA> S&P </PARA> ...
It seems that StringEscapeUtils.unescapeXml() escapes the input twice, or as long as it contains entities.
Is there a better utility method, or simple solution that can unescape every xml entity (not just a few but all accentuated character) once, so that my encoded & part does not get screwed up?
Thank, Peter
Upvotes: 0
Views: 902
Reputation: 4266
Perhaps a long winded way of doing it, but I can't use Apache Commons
public static void main(String[] args) {
String a = "<PARA> S&amp;P </PARA>";
String ea = unescapeXML(a);
System.out.println(ea);
}
public static String unescapeXML(final String xml) {
Pattern xmlEntityRegex = Pattern.compile("&(#?)([^;]+);");
StringBuffer unescapedOutput = new StringBuffer(xml.length());
Matcher m = xmlEntityRegex.matcher(xml);
Map<String, String> builtinEntities = null;
String entity;
String hashmark;
String ent;
int code;
while (m.find()) {
ent = m.group(2);
hashmark = m.group(1);
if ((hashmark != null) && (hashmark.length() > 0)) {
code = Integer.parseInt(ent);
entity = Character.toString((char) code);
} else {
if (builtinEntities == null) {
builtinEntities = buildBuiltinXMLEntityMap();
}
entity = builtinEntities.get(ent);
if (entity == null) {
entity = "&" + ent + ';';
}
}
m.appendReplacement(unescapedOutput, entity);
}
m.appendTail(unescapedOutput);
return unescapedOutput.toString();
}
private static Map<String, String> buildBuiltinXMLEntityMap() {
Map<String, String> entities = new HashMap<>(10);
entities.put("lt", "<");
entities.put("gt", ">");
entities.put("amp", "&");
entities.put("apos", "'");
entities.put("quot", "\"");
return entities;
}
Output:
<PARA> S&P </PARA>
Upvotes: 1
Reputation: 10373
When you use third-party libraries, you should include the library name and the version.
StringEscapeUtils
is part of Apache Commons Text and Apache Commons Lang (deprecated). The latest versions (as of November 2017) are Commons Text 1.1 and Commons Lang 3.3.7. Both versions show correct results.
import org.apache.commons.text.StringEscapeUtils;
public class EscapeTest {
public static void main(String[] args) {
final String s = "<PARA> S&amp;P </PARA>";
System.out.println(StringEscapeUtils.unescapeXml(s));
}
}
Output: <PARA> S&P </PARA>
Upvotes: 3