Java - replace all non-ASCII but leave HTML special characters

Question

If I have a String

String mine = "Some Name ® plus encoding issue ????? \u0000 something ";

I would like to keep all the ASCII characters and HTML entities, but remove any other encoding.

I tried

mine.replaceAll("[^\x00-\x7F]", "");

but this removes things like trademark and copyright

Is there a way to keep the HTML entities but remove all other encoding?

StoopidDonut · Accepted Answer

You can use a combination of Normalize and EscapeHtml to achieve it, with a fair amount of accuracy:

String mine = "site design / logo © 2014 stack exchange inc; árvíztűrő tükörfúrógép";
mine = Normalizer.normalize(mine, Normalizer.Form.NFD); // Normalize with Canonical decomposition
mine = StringEscapeUtils.escapeHtml3(mine); // Escape the html values now
System.out.println(mine); // Would be - site design / logo © 2014 stack exchange inc; árvíztűrő tükörfúrógép

mine = mine.replaceAll("[^\p{ASCII}]", "");
mine = StringEscapeUtils.unescapeHtml3(mine); // Unescape
System.out.println(mine); // site design / logo © 2014 stack exchange inc; arvizturo tukorfurogep

Normalize with canonical decomposition would map the accented characters (in this case) with their, well, canonical decomposition values. (link provides for an excellent resource for that)

StringEscapeUtils is a handy utility class with escape/unescape htmls, csvs, xmls.

Hence, I first use the NFD to normalize the String to evade the escapeHtml3 process (else each accented char would be replaced by its accented counterpart).

Now when I escape Html, copyright symbol gets escaped without affecting the accents. After removing the non-ascii part, accented are replaced by their counterparts but copyright is still escaped, which I can easily revert with the unescapeHtml3 back to its original form.

You an go through the respective links to gain more perspective about the behavior which I have tried to exploit in this case.

Java - replace all non-ASCII but leave HTML special characters

Answers (2)

Related Questions