Peter Jaloveczki
Peter Jaloveczki

Reputation: 2089

Escape XML entities only once

I have the following XML snippet in a string (note the double encoded &):

...
<PARA>
S&P
</PARA>
...

My desired output would be:

> ... <PARA> S&amp;P </PARA> ...

If I use:

StringEscapeUtils.unescapeXml()

The actual oputput is:

 > ... <PARA> S&P </PARA> ...

It seems that StringEscapeUtils.unescapeXml() escapes the input twice, or as long as it contains entities.

Is there a better utility method, or simple solution that can unescape every xml entity (not just a few but all accentuated character) once, so that my encoded & part does not get screwed up?

Thank, Peter

Upvotes: 0

Views: 902

Answers (2)

achAmh&#225;in
achAmh&#225;in

Reputation: 4266

Perhaps a long winded way of doing it, but I can't use Apache Commons

public static void main(String[] args) {
        String a = "&lt;PARA&gt; S&amp;amp;P &lt;/PARA&gt;";
        String ea = unescapeXML(a);
        System.out.println(ea);
    }

    public static String unescapeXML(final String xml) {
        Pattern xmlEntityRegex = Pattern.compile("&(#?)([^;]+);");
        StringBuffer unescapedOutput = new StringBuffer(xml.length());

        Matcher m = xmlEntityRegex.matcher(xml);
        Map<String, String> builtinEntities = null;
        String entity;
        String hashmark;
        String ent;
        int code;
        while (m.find()) {
            ent = m.group(2);
            hashmark = m.group(1);
            if ((hashmark != null) && (hashmark.length() > 0)) {
                code = Integer.parseInt(ent);
                entity = Character.toString((char) code);
            } else {
                if (builtinEntities == null) {
                    builtinEntities = buildBuiltinXMLEntityMap();
                }
                entity = builtinEntities.get(ent);
                if (entity == null) {
                    entity = "&" + ent + ';';
                }
            }
            m.appendReplacement(unescapedOutput, entity);
        }
        m.appendTail(unescapedOutput);
        return unescapedOutput.toString();

    }

    private static Map<String, String> buildBuiltinXMLEntityMap() {
        Map<String, String> entities = new HashMap<>(10);
        entities.put("lt", "<");
        entities.put("gt", ">");
        entities.put("amp", "&");
        entities.put("apos", "'");
        entities.put("quot", "\"");
        return entities;
    }

Output:

<PARA> S&amp;P </PARA>

Upvotes: 1

vanje
vanje

Reputation: 10373

When you use third-party libraries, you should include the library name and the version.

StringEscapeUtils is part of Apache Commons Text and Apache Commons Lang (deprecated). The latest versions (as of November 2017) are Commons Text 1.1 and Commons Lang 3.3.7. Both versions show correct results.

import org.apache.commons.text.StringEscapeUtils;
public class EscapeTest {
  public static void main(String[] args) {
    final String s = "&lt;PARA&gt; S&amp;amp;P &lt;/PARA&gt;";
    System.out.println(StringEscapeUtils.unescapeXml(s));
  }
}

Output: <PARA> S&amp;P </PARA>

Upvotes: 3

Related Questions