SoluableNonagon
SoluableNonagon

Reputation: 11752

Java - replace all non-ASCII but leave HTML special characters

If I have a String

String mine = "Some Name ® plus encoding issue ????? \u0000 something ";

I would like to keep all the ASCII characters and HTML entities, but remove any other encoding.

I tried

mine.replaceAll("[^\\x00-\\x7F]", ""); 

but this removes things like trademark and copyright

Is there a way to keep the HTML entities but remove all other encoding?

Upvotes: 2

Views: 2601

Answers (2)

StoopidDonut
StoopidDonut

Reputation: 8617

You can use a combination of Normalize and EscapeHtml to achieve it, with a fair amount of accuracy:

String mine = "site design / logo © 2014 stack exchange inc; árvíztűrő tükörfúrógép";
mine = Normalizer.normalize(mine, Normalizer.Form.NFD); // Normalize with Canonical decomposition
mine = StringEscapeUtils.escapeHtml3(mine); // Escape the html values now
System.out.println(mine); // Would be - site design / logo © 2014 stack exchange inc; árvíztűrő tükörfúrógép

mine = mine.replaceAll("[^\\p{ASCII}]", "");
mine = StringEscapeUtils.unescapeHtml3(mine); // Unescape
System.out.println(mine); // site design / logo © 2014 stack exchange inc; arvizturo tukorfurogep

Normalize with canonical decomposition would map the accented characters (in this case) with their, well, canonical decomposition values. (link provides for an excellent resource for that)

StringEscapeUtils is a handy utility class with escape/unescape htmls, csvs, xmls.

Hence, I first use the NFD to normalize the String to evade the escapeHtml3 process (else each accented char would be replaced by its accented counterpart).

Now when I escape Html, copyright symbol gets escaped without affecting the accents. After removing the non-ascii part, accented are replaced by their counterparts but copyright is still escaped, which I can easily revert with the unescapeHtml3 back to its original form.

You an go through the respective links to gain more perspective about the behavior which I have tried to exploit in this case.

Upvotes: 2

anubhava
anubhava

Reputation: 786091

You can use \\p{ASCII} property:

mine = mine.replaceAll("[^\\p{ASCII}]+", "");

OR else use \\P{ASCII}:

mine = mine.replaceAll("\\P{ASCII}+", "");

Upvotes: 3

Related Questions