Sunny Sachdeva
Sunny Sachdeva

Reputation: 289

Convert non english string to normal String in Java

I am required to validate certain text against some baselines.

For ex:

String a="La Panthère"; 
String b="La Panthère";

I know that string b contains HTML literals so I am using Apache StringEscapeUtils which gives me

String b="La Panthère";
b=StringEscapeUtils.unescapeHtml(b);

Output:- La Panthère

However I do not know whats stored in string a. Somewhere from SO I got to know that this might be ascent literals and hence tried below code

a=Normalizer.normalize(a, Normalizer.Form.NFKD);

Note: I tried all forms of Normalizer but nothing worked.

Can some one please help me in how to make String a in same fashion as that of b?

Upvotes: 1

Views: 1104

Answers (1)

Mena
Mena

Reputation: 48404

As Jesper mentions, the è pattern typically indicates a mis-encoding.

At that point, you're already out of luck.

Remedial actions such as replacing the è are not advisable, nor safe.

Escaping or normalizing the String is out of scope, as your problem is at the source and has nothing to do with HTML conversion or accent normalization.

However, there are simple idioms to convert the String to a different encoding.

The example below:

  • simulates a Windows-1252 String (in a UTF-8 environment).
  • then, it prints it as is (corrupted, since it's a Windows-1252 String in a UTF-8 print stream).
  • finally, it prints it re-converted to UTF-8.

    String a = new String(
    "La Panthère".getBytes(Charset.forName("UTF-8")),
     Charset.forName("Cp1252")
    );
    System.out.println(a);
    System.out.println(
        new String(
            a.getBytes(Charset.forName("Cp1252")), 
            Charset.forName("UTF-8")
        )
    );
    

Output

La Panthère
La Panthère

Notes

The conversion idiom described above implies you know how the original String is encoded beforehand.

Typical encoding issues take place when the following encoding are used to interpret text in one another:

  • ISO Latin 1
  • Windows-1252
  • UTF-8

Here's a list of Java-supported encodings along with their canonical names.

In a web context, you'd typically invoke Javascript's encodeURIComponent function to encode your values in the front-end, before sending them to the back-end.

Upvotes: 2

Related Questions