Reputation: 11752
If I have a String
String mine = "Some Name ® plus encoding issue ????? \u0000 something ";
I would like to keep all the ASCII characters and HTML entities, but remove any other encoding.
I tried
mine.replaceAll("[^\\x00-\\x7F]", "");
but this removes things like trademark and copyright
Is there a way to keep the HTML entities but remove all other encoding?
Upvotes: 2
Views: 2601
Reputation: 8617
You can use a combination of Normalize
and EscapeHtml
to achieve it, with a fair amount of accuracy:
String mine = "site design / logo © 2014 stack exchange inc; árvíztűrő tükörfúrógép";
mine = Normalizer.normalize(mine, Normalizer.Form.NFD); // Normalize with Canonical decomposition
mine = StringEscapeUtils.escapeHtml3(mine); // Escape the html values now
System.out.println(mine); // Would be - site design / logo © 2014 stack exchange inc; árvíztűrő tükörfúrógép
mine = mine.replaceAll("[^\\p{ASCII}]", "");
mine = StringEscapeUtils.unescapeHtml3(mine); // Unescape
System.out.println(mine); // site design / logo © 2014 stack exchange inc; arvizturo tukorfurogep
Normalize with canonical decomposition would map the accented characters (in this case) with their, well, canonical decomposition values. (link provides for an excellent resource for that)
StringEscapeUtils is a handy utility class with escape/unescape htmls, csvs, xmls
.
Hence, I first use the NFD to normalize the String to evade the escapeHtml3
process (else each accented char would be replaced by its accented counterpart).
Now when I escape Html, copyright
symbol gets escaped without affecting the accents. After removing the non-ascii part, accented are replaced by their counterparts but copyright
is still escaped, which I can easily revert with the unescapeHtml3
back to its original form.
You an go through the respective links to gain more perspective about the behavior which I have tried to exploit in this case.
Upvotes: 2
Reputation: 786091
You can use \\p{ASCII}
property:
mine = mine.replaceAll("[^\\p{ASCII}]+", "");
OR else use \\P{ASCII}
:
mine = mine.replaceAll("\\P{ASCII}+", "");
Upvotes: 3