Reputation: 289
I am required to validate certain text against some baselines.
For ex:
String a="La Panthère";
String b="La Panthère";
I know that string b
contains HTML literals so I am using Apache StringEscapeUtils
which gives me
String b="La Panthère";
b=StringEscapeUtils.unescapeHtml(b);
Output:- La Panthère
However I do not know whats stored in string a. Somewhere from SO I got to know that this might be ascent literals and hence tried below code
a=Normalizer.normalize(a, Normalizer.Form.NFKD);
Note: I tried all forms of Normalizer but nothing worked.
Can some one please help me in how to make String a in same fashion as that of b
?
Upvotes: 1
Views: 1104
Reputation: 48404
As Jesper mentions, the è
pattern typically indicates a mis-encoding.
At that point, you're already out of luck.
Remedial actions such as replacing the è
are not advisable, nor safe.
Escaping or normalizing the String
is out of scope, as your problem is at the source and has nothing to do with HTML conversion or accent normalization.
However, there are simple idioms to convert the String
to a different encoding.
The example below:
String
(in a UTF-8 environment). String
in a UTF-8 print stream). finally, it prints it re-converted to UTF-8.
String a = new String(
"La Panthère".getBytes(Charset.forName("UTF-8")),
Charset.forName("Cp1252")
);
System.out.println(a);
System.out.println(
new String(
a.getBytes(Charset.forName("Cp1252")),
Charset.forName("UTF-8")
)
);
Output
La Panthère
La Panthère
Notes
The conversion idiom described above implies you know how the original String
is encoded beforehand.
Typical encoding issues take place when the following encoding are used to interpret text in one another:
Here's a list of Java-supported encodings along with their canonical names.
In a web context, you'd typically invoke Javascript's encodeURIComponent function to encode your values in the front-end, before sending them to the back-end.
Upvotes: 2