Vladimir
Vladimir

Reputation: 13173

How do I convert special characters using java?

I have strings like:

Avery® Laser & Inkjet Self-Adhesive

I need to convert them to

Avery Laser & Inkjet Self-Adhesive.

I.e. remove special characters and convert html special chars to regular ones.

Upvotes: 9

Views: 41743

Answers (4)

BalusC
BalusC

Reputation: 1109635

Avery® Laser & Inkjet Self-Adhesive

First use StringEscapeUtils#unescapeHtml4() (or #unescapeXml(), depending on the original format) to unescape the & into a &. Then use String#replaceAll() with [^\x20-\x7e] to get rid of characters which aren't inside the printable ASCII range.

Summarized:

String clean = StringEscapeUtils.unescapeHtml4(dirty).replaceAll("[^\\x20-\\x7e]", "");

..which produces

Avery Laser & Inkjet Self-Adhesive

(without the trailing dot as in your example, but that wasn't present in the original ;) )

That said, this however look like more a request to workaround than a request to solution. If you elaborate more about the functional requirement and/or where this string did originate, we may be able to provide the right solution. The ® namely look like to be caused by using the wrong encoding to read the string in and the & look like to be caused by using a textbased parser to read the string in instead of a fullfledged HTML parser.

Upvotes: 20

Romain Linsolas
Romain Linsolas

Reputation: 81667

You can use the StringEscapeUtils class from Apache Commons Text project.

Upvotes: 6

Bala Dutt
Bala Dutt

Reputation: 55

Incase you want to mimic what php function htmlspecialchars_decode does use php function get_html_translation_table() to dump the table and then use the java code like,

    static Hashtable html_specialchars_table = new Hashtable();
    static {
            html_specialchars_table.put("&lt;","<");
            html_specialchars_table.put("&gt;",">");
            html_specialchars_table.put("&amp;","&");
    }
    static String htmlspecialchars_decode_ENT_NOQUOTES(String s){
            Enumeration en = html_specialchars_table.keys();
            while(en.hasMoreElements()){
                    String key = (String)en.nextElement();
                    String val = (String)html_specialchars_table.get(key);
                    s = s.replaceAll(key, val);
            }
            return s;
    }

Upvotes: 1

oropher
oropher

Reputation: 29

Maybe you can use something like:

yourTxt = yourTxt.replaceAll("&amp;", "&");

in some project I did something like:

public String replaceAcutesHTML(String str) {

str = str.replaceAll("&aacute;","á");
str = str.replaceAll("&eacute;","é");
str = str.replaceAll("&iacute;","í");
str = str.replaceAll("&oacute;","ó");
str = str.replaceAll("&uacute;","ú");
str = str.replaceAll("&Aacute;","Á");
str = str.replaceAll("&Eacute;","É");
str = str.replaceAll("&Iacute;","Í");
str = str.replaceAll("&Oacute;","Ó");
str = str.replaceAll("&Uacute;","Ú");
str = str.replaceAll("&ntilde;","ñ");
str = str.replaceAll("&Ntilde;","Ñ");

return str;

}

Upvotes: 1

Related Questions