Reputation: 32094

How to count grapheme clusters or "perceived" emoji characters in Java

I'm looking to count the number of perceived emoji characters in a provided Java string. I'm currently using the emoji4j library, but it doesn't work for grapheme clusters like this one: 👩‍👩‍👦‍👦

Calling EmojiUtil.getLength("👩‍👩‍👦‍👦") returns 4 instead of 1, and similarly calling EmojiUtil.getLength("👻👩‍👩‍👦‍👦") returns 5 instead of 2.

Are there any APIs or methods on String in Java that make it easy to count grapheme clusters?

I've been hunting around but understandably the codePoints() method on a String includes not only the visible emojis, but also the zero width joiners.

I also attempted this using the BreakIterator:

public static int getLength(String emoji) {
    BreakIterator it = BreakIterator.getCharacterInstance();
    it.setText(emoji);
    int emojiCount = 0;
    while (it.next() != BreakIterator.DONE) {
        emojiCount++;
    }
    return emojiCount;
}

But it seems to behave identically to the codePoints() method, returning 8 for something like "👻👩‍👩‍👦‍👦".

Upvotes: 21

Answers (3)

Craig Otis

Reputation: 32094

ICU4J

I ended up using the ICU library, which worked much better. No changes (aside from import statements) were needed from my original codeblock, as it simply provides a different implementation of BreakIterator.

Upvotes: 12

skomisa

Reputation: 17363

More than six years after this question was asked, an enhancement to properly process grapheme clusters within a String was finally implemented in Java 20, which was released a few weeks ago. See JDK-8291660 Grapheme support in BreakIterator.

There is no change to the API of the BreakIterator class, but its underlying code now correctly treats a grapheme cluster as a single unit rather than multiple characters.

Here is a sample application, using the method and data provided in the question without any changes:

import java.nio.charset.Charset;
import java.text.BreakIterator;

public class Main {

    public static void main(String[] args) throws java.io.UnsupportedEncodingException {
        System.out.println("System.getProperty(\"java.version\"): " + System.getProperty("java.version"));
        System.out.println("Charset.defaultCharset():" + Charset.defaultCharset());
        Main.printStringInfo("👩‍👩‍👦‍👦");
        Main.printStringInfo("👻👩‍👩‍👦‍👦");
    }

    static void printStringInfo(String s) {
        System.out.print("\nCode points for the String " + s + ":");
        s.codePoints().mapToObj(Integer::toHexString).forEach(x -> System.out.print(x + " "));
        System.out.println("\nThe length of the String " + s + " using String.length() is " + s.length());
        System.out.println("The length of the String " + s + " using BreakIterator is " + Main.getLength(s));
    }

    // Returns the correct number of perceived characters in a String.
    // Requires JDK 20+ to work correctly.
    // Earlier Java releases will incorrectly just count the code points instead.
    // JDK-8291660 "Grapheme support in BreakIterator" (https://bugs.openjdk.org/browse/JDK-8291660) refers.
    public static int getLength(String emoji) {
        BreakIterator it = BreakIterator.getCharacterInstance();
        it.setText(emoji);
        int count = 0;
        while (it.next() != BreakIterator.DONE) {
            count++;
        }
        return count;
    }
}

Here is the output, showing the correct grapheme counts (1 and 2) when using JDK 20:

C:\Java\jdk-20\bin\java.exe -javaagent:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\lib\idea_rt.jar=53642:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\bin -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath D:\II2023.1\Graphemes\out\production\Graphemes Main
System.getProperty("java.version"): 20-ea
Charset.defaultCharset():UTF-8

Code points for the String 👩‍👩‍👦‍👦:1f469 200d 1f469 200d 1f466 200d 1f466 
The length of the String 👩‍👩‍👦‍👦 using String.length() is 11
The length of the String 👩‍👩‍👦‍👦 using BreakIterator is 1

Code points for the String 👻👩‍👩‍👦‍👦:1f47b 1f469 200d 1f469 200d 1f466 200d 1f466 
The length of the String 👻👩‍👩‍👦‍👦 using String.length() is 13
The length of the String 👻👩‍👩‍👦‍👦 using BreakIterator is 2

Process finished with exit code 0

And here is the output for the identical code showing incorrect grapheme counts (7 and 8) when using JDK 17:

C:\Java\jdk-17.0.2\bin\java.exe -javaagent:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\lib\idea_rt.jar=53775:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\bin -Dfile.encoding=UTF-8 -classpath D:\II2023.1\Graphemes\out\production\Graphemes Main
System.getProperty("java.version"): 17.0.2
Charset.defaultCharset():UTF-8

Code points for the String 👩‍👩‍👦‍👦:1f469 200d 1f469 200d 1f466 200d 1f466 
The length of the String 👩‍👩‍👦‍👦 using String.length() is 11
The length of the String 👩‍👩‍👦‍👦 using BreakIterator is 7

Code points for the String 👻👩‍👩‍👦‍👦:1f47b 1f469 200d 1f469 200d 1f466 200d 1f466 
The length of the String 👻👩‍👩‍👦‍👦 using String.length() is 13
The length of the String 👻👩‍👩‍👦‍👦 using BreakIterator is 8

Process finished with exit code 0

I tested this in Intellij IDEA 2023.1.1 Preview using Oracle OpenJDK version 20.0.1 and Oracle OpenJDK version 17.0.2

Upvotes: 6

Michael Allan

Reputation: 3931

JDK 15 added support for extended grapheme clusters to the java.util.regex package. Here’s a solution based on that:

/** Returns the number of grapheme clusters within `text` between positions
  * `start` and `end`.  Omits any partial cluster at the end of the span.
  */
int columnarSpan( String text, int start, int end ) {
    return columnarSpan( text, start, end, /*wholeOnly*/true ); }


/** @param wholeOnly Whether to omit any partial cluster at the end
  *   of the span.  Iff `true` and `end` bisects the final cluster,
  *   then the final cluster is omitted from the count.
  */
int columnarSpan( final String text, final int start, final int end,
      final boolean wholeOnly ) {
    graphemeMatcher.reset( text ).region( start, end );
    int count = 0;
    while( graphemeMatcher.find() ) ++count;
    if( wholeOnly  &&  count > 0  &&  end < text.length() ) {
        final int countNext = columnarSpan( text, start, end + 1, false );
        if( countNext == count ) --count; } /* The character at `end` bisects
          the final cluster, which therefore lies partly outside the span.
          Therefore exclude it from the count. */
    return count; }


final Matcher graphemeMatcher = graphemePattern.matcher( "" );


/** The pattern of a grapheme cluster.
  */
public static final Pattern graphemePattern = Pattern.compile( "\\X" ); } /*
  An alternative means of cluster discovery is `java.txt.BreakIterator`.
  Long outdated in this regard,  [https://bugs.openjdk.org/browse/JDK-8174266]
  it was updated for JDK 20.  [https://stackoverflow.com/a/76109241/2402790] */

Call it like this:

String emoji = "👻👩‍👩‍👦‍👦";
int count = columnarSpan( emoji, 0, /*end*/emoji.length() );
System.out.println( count );

⇒ 2

Note that it counts whole clusters only. If the given end bisects the final cluster — the character at position end being part of the same extended cluster as the preceding character — then the final cluster is omitted from the count. For example:

int count = columnarSpan( emoji, 0, /*end*/emoji.length() - 1 );
System.out.println( count );

⇒ 1

This is generally the behaviour you want in order to print a line of text with a character pointer positioned beneath it (e.g. ‘^’) pointing into the cluster of the character at the given index. To defeat this behaviour (pointing after the cluster), call the base method as follows.

int count = columnarSpan( emoji, 0, /*end*/emoji.length() - 1, false );
System.out.println( count );

⇒ 2

(Updated as per Skomisa’s comment.)

Upvotes: 4

How to count grapheme clusters or &quot;perceived&quot; emoji characters in Java

Answers (3)

ICU4J

Related Questions

How to count grapheme clusters or "perceived" emoji characters in Java