Reputation: 105133

How to unescape HTML entities but leave XML entities untouched?

This is the input:

<div>The price is &lt; 5 &euro;</div>

It is a valid HTML but is not a valid XML (because € is not declared in DTD). A valid XML would look like:

<div>The price is &lt; 5 &#8364;</div>

Can you recommend some Java library that can help me to unescape HTML entities and convert them to XML entities?

Upvotes: 1

Answers (2)

Roman

Reputation: 3299

Using apache commons lang 3, a class that only replaces the HTML-specific entities:

import org.apache.commons.text.translate.AggregateTranslator;
import org.apache.commons.text.translate.CharSequenceTranslator;
import org.apache.commons.text.translate.EntityArrays;
import org.apache.commons.text.translate.LookupTranslator;
import org.apache.commons.text.translate.NumericEntityUnescaper;


public class HtmlEscapeUtils {

  /**
   * @see {@link org.apache.commons.text.StringEscapeUtils#UNESCAPE_HTML4}
   */
  public static final CharSequenceTranslator UNESCAPE_HTML_SPECIFIC =
      new AggregateTranslator(
          new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
          new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
          new NumericEntityUnescaper());


  /**
   * @see {@link org.apache.commons.text.StringEscapeUtils#unescapeHtml4(String)}
   * @param input - HTML String with e.g. &quot; &amp; &auml;
   * @return XML String, HTML4 Entities replaced, but XML Entites remain (e.g. &quot; und &amp;)
   */
  public static final String unescapeHtmlToXml(final String input) {
    return UNESCAPE_HTML_SPECIFIC.translate(input);
  }

}

Upvotes: 3

Mike Samuel

Reputation: 120526

The list of all HTML named character references is available at http://www.whatwg.org/specs/web-apps/current-work/multipage/entities.json

If you can tolerate the occasional mistake, you could just go over that file and replace all named character references that are not allowed in stand-alone XML with the corresponding numeric character reference.

That simple approach can run into problems though if your input is HTML, not XHTML:

<script>var y=1, lt = 3, x = y&lt; alert(x);</script>

contains a script element whose content is not encoded using entities, so naively replacing the < here will break the script. There are other elements like <xmp> and <style> that can have similar problems as will CDATA sections in foreign XML elements.

If you need a really faithful conversion, or if your HTML is messy, your best bet might be to parse the HTML to a DOM using something like nu.validator and then use How to pretty print XML from Java? to convert the DOM to valid XML.

Even if your input is XHTML, you might need to worry about character sequences that look like entities in CDATA sections. Again, parse and re-render might be your best option.

Upvotes: 1

How to unescape HTML entities but leave XML entities untouched?

Answers (2)

Related Questions