Reputation: 4103
I am getting the below string with unicode characters in an XML from one of my feed providers which I am unable to parse. I also tried to get the Hex code for these characters and then prepend it with \u as well but that also did not work.
String str = "🎉🎉🎉🎉</fullText" + ">";
StringBuilder strb = new StringBuilder();
char[] chars = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
char c = chars[i];
if ( c >= Character.MIN_HIGH_SURROGATE && c <= Character.MAX_HIGH_SURROGATE ) {
char ch2 = chars[i+1];
} else
strb.append(c);
}
System.out.println(strb.toString());
This should ideally have skipped those characters but it hasn't. I want to get rid of those characters in the string.
Has anyone faced a similar issue? Any help in this regard is highly appreciated.
Vaibhav
Upvotes: 1
Views: 1970
Reputation: 201828
The code seems to skip high surrogate code points only. The simplest change to make it skip the following low surrogate too is to change the line
char ch2 = chars[i+1];
to
i++;
However, it is more robust and makes the code more readable to write the loop this way:
for (int i = 0; i < chars.length; i++) {
char c = chars[i];
Character.UnicodeBlock block = Character.UnicodeBlock.of(c);
if(block != Character.UnicodeBlock.HIGH_SURROGATES &&
block != Character.UnicodeBlock.LOW_SURROGATES) {
strb.append(c);
}
}
This also handles malformed data containing isolated high or low surrogates or high and low surrogate in wrong order (which means data that should be skipped or error-handled even if you handled valid surrogate pairs as acceptable characters).
Upvotes: 1