Extracting Double Byte Characters/substring from a UTF-8 formatted String

Question

I'm trying to extract emojis and other special Characters from Strings for further processing (e.g. a String contains '😅' as one of its Characters).

But neither string.charAt(i) nor string.substring(i, i+1) work for me. The original String is formatted in UTF-8 and this means, that the escaped form of the above emoji is encoded as '\uD83D\uDE05'. That's why I receive '?' (\uD83D) and '?' (\uDE05) instead for this position, causing it to be at two positions when iterating over the String.

Does anyone have a solution to this problem?

Paavo Pohndorff · Accepted Answer

Thanks to John Kugelman for the help. the solution looks like this now:

for(int codePoint : codePoints(string)) {

        char[] chars = Character.toChars(codePoint);
        System.out.println(codePoint + " : " + String.copyValueOf(chars));

    }

With the codePoints(String string)-method looking like this:

private static Iterable codePoints(final String string) {
    return new Iterable() {
        public Iterator iterator() {
            return new Iterator() {
                int nextIndex = 0;

                public boolean hasNext() {
                    return nextIndex < string.length();
                }

                public Integer next() {
                    int result = string.codePointAt(nextIndex);
                    nextIndex += Character.charCount(result);
                    return result;
                }

                public void remove() {
                    throw new UnsupportedOperationException();
                }
            };
        }
    };
}

Extracting Double Byte Characters/substring from a UTF-8 formatted String

Answers (1)

Related Questions