Reputation: 3189

Split UTF-16 String into single chars/strings

I have string that looks like this a👏b🙂c and I want to split it to single chars/strings.

static List<String> split(String text ) {
    List<String> list = new ArrayList<>(text.length());
    for(int i = 0; i < text.length() ; i++) {
        list.add(text.substring(i, i + 1));
    }
    return list;
}

public static void main(String... args) {
    split("a\uD83D\uDC4Fb\uD83D\uDE42c")
            .forEach(System.out::println);
}

As you might already notice instead of 👏 and 🙂 I'm getting two weird characters:

a
?
?
b
?
?
c

Upvotes: 4

Answers (3)

Michael Gantman

Reputation: 7792

There is an Open source MgntUtils library (written by me) that has a utility that translates any string into unicodes and vise-versa (handling correctly code-points) this can help you handling your problem as well as understand the internal work going on behind the sciences. Here is an example:

the code below

String result = "a👏b🙂c";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

would produce the following:

\u0061\u1f44f\u0062\u1f642\u0063
a👏b🙂c

Here is te link to the article that explains about the MgntUtils library and where to get it (including javadoc and source code): Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison. Look for paragraph "String Unicode converter"

Upvotes: 0

Karol Dowbecki

Reputation: 44952

As per Character and String APIs docs you need to use code points to correctly handle the UTF multi-byte sequences.

"a👏b🙂c".codePoints().mapToObj(Character::toChars).forEach(System.out::println);

will output

a
👏
b
🙂
c

Upvotes: 6

Tomasz Linkowski

Reputation: 4496

The following will do the job:

List<String> split(String text) {
    return text.codePoints()
            .mapToObj(Character::toChars)
            .map(String::valueOf)
            .collect(Collectors.toList());
}

Upvotes: 6

Split UTF-16 String into single chars/strings

Answers (3)

Related Questions