Java unable to parse few unicode characters received from a feed

Question

I am getting the below string with unicode characters in an XML from one of my feed providers which I am unable to parse. I also tried to get the Hex code for these characters and then prepend it with \u as well but that also did not work.

String str = "🎉🎉🎉🎉";
StringBuilder strb = new StringBuilder();
char[] chars = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
  char c = chars[i];
  if ( c >= Character.MIN_HIGH_SURROGATE && c <= Character.MAX_HIGH_SURROGATE ) {
    char ch2 = chars[i+1];
  } else
    strb.append(c);
}
System.out.println(strb.toString());

This should ideally have skipped those characters but it hasn't. I want to get rid of those characters in the string.

Has anyone faced a similar issue? Any help in this regard is highly appreciated.

Vaibhav

Jukka K. Korpela · Accepted Answer

The code seems to skip high surrogate code points only. The simplest change to make it skip the following low surrogate too is to change the line

        char ch2 = chars[i+1];

to

        i++;

However, it is more robust and makes the code more readable to write the loop this way:

 for (int i = 0; i < chars.length; i++) {
     char c = chars[i];
     Character.UnicodeBlock block = Character.UnicodeBlock.of(c);
     if(block != Character.UnicodeBlock.HIGH_SURROGATES && 
        block != Character.UnicodeBlock.LOW_SURROGATES) {
           strb.append(c);
     }
 }

This also handles malformed data containing isolated high or low surrogates or high and low surrogate in wrong order (which means data that should be skipped or error-handled even if you handled valid surrogate pairs as acceptable characters).

Java unable to parse few unicode characters received from a feed

Answers (1)

Related Questions