vaibhav
vaibhav

Reputation: 4103

Java unable to parse few unicode characters received from a feed

I am getting the below string with unicode characters in an XML from one of my feed providers which I am unable to parse. I also tried to get the Hex code for these characters and then prepend it with \u as well but that also did not work.

String str = "🎉🎉🎉🎉</fullText" + ">";
StringBuilder strb = new StringBuilder();
char[] chars = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
  char c = chars[i];
  if ( c >= Character.MIN_HIGH_SURROGATE && c <= Character.MAX_HIGH_SURROGATE ) {
    char ch2 = chars[i+1];
  } else
    strb.append(c);
}
System.out.println(strb.toString());

This should ideally have skipped those characters but it hasn't. I want to get rid of those characters in the string.

Has anyone faced a similar issue? Any help in this regard is highly appreciated.

Vaibhav

Upvotes: 1

Views: 1970

Answers (1)

Jukka K. Korpela
Jukka K. Korpela

Reputation: 201828

The code seems to skip high surrogate code points only. The simplest change to make it skip the following low surrogate too is to change the line

        char ch2 = chars[i+1]; 

to

        i++;

However, it is more robust and makes the code more readable to write the loop this way:

 for (int i = 0; i < chars.length; i++) {
     char c = chars[i];
     Character.UnicodeBlock block = Character.UnicodeBlock.of(c);
     if(block != Character.UnicodeBlock.HIGH_SURROGATES && 
        block != Character.UnicodeBlock.LOW_SURROGATES) {
           strb.append(c);
     }
 }

This also handles malformed data containing isolated high or low surrogates or high and low surrogate in wrong order (which means data that should be skipped or error-handled even if you handled valid surrogate pairs as acceptable characters).

Upvotes: 1

Related Questions