Reputation: 32094
I'm looking to count the number of perceived emoji characters in a provided Java string. I'm currently using the emoji4j library, but it doesn't work for grapheme clusters like this one: π©βπ©βπ¦βπ¦
Calling EmojiUtil.getLength("π©βπ©βπ¦βπ¦")
returns 4
instead of 1
, and similarly calling EmojiUtil.getLength("π»π©βπ©βπ¦βπ¦")
returns 5
instead of 2
.
Are there any APIs or methods on String
in Java that make it easy to count grapheme clusters?
I've been hunting around but understandably the codePoints()
method on a String
includes not only the visible emojis, but also the zero width joiners.
I also attempted this using the BreakIterator
:
public static int getLength(String emoji) {
BreakIterator it = BreakIterator.getCharacterInstance();
it.setText(emoji);
int emojiCount = 0;
while (it.next() != BreakIterator.DONE) {
emojiCount++;
}
return emojiCount;
}
But it seems to behave identically to the codePoints()
method, returning 8
for something like "π»π©βπ©βπ¦βπ¦"
.
Upvotes: 21
Views: 5298
Reputation: 32094
I ended up using the ICU library, which worked much better. No changes (aside from import statements) were needed from my original codeblock, as it simply provides a different implementation of BreakIterator
.
Upvotes: 12
Reputation: 17363
More than six years after this question was asked, an enhancement to properly process grapheme clusters within a String
was finally implemented in Java 20, which was released a few weeks ago. See JDK-8291660 Grapheme support in BreakIterator.
There is no change to the API of the BreakIterator class, but its underlying code now correctly treats a grapheme cluster as a single unit rather than multiple characters.
Here is a sample application, using the method and data provided in the question without any changes:
import java.nio.charset.Charset;
import java.text.BreakIterator;
public class Main {
public static void main(String[] args) throws java.io.UnsupportedEncodingException {
System.out.println("System.getProperty(\"java.version\"): " + System.getProperty("java.version"));
System.out.println("Charset.defaultCharset():" + Charset.defaultCharset());
Main.printStringInfo("π©βπ©βπ¦βπ¦");
Main.printStringInfo("π»π©βπ©βπ¦βπ¦");
}
static void printStringInfo(String s) {
System.out.print("\nCode points for the String " + s + ":");
s.codePoints().mapToObj(Integer::toHexString).forEach(x -> System.out.print(x + " "));
System.out.println("\nThe length of the String " + s + " using String.length() is " + s.length());
System.out.println("The length of the String " + s + " using BreakIterator is " + Main.getLength(s));
}
// Returns the correct number of perceived characters in a String.
// Requires JDK 20+ to work correctly.
// Earlier Java releases will incorrectly just count the code points instead.
// JDK-8291660 "Grapheme support in BreakIterator" (https://bugs.openjdk.org/browse/JDK-8291660) refers.
public static int getLength(String emoji) {
BreakIterator it = BreakIterator.getCharacterInstance();
it.setText(emoji);
int count = 0;
while (it.next() != BreakIterator.DONE) {
count++;
}
return count;
}
}
Here is the output, showing the correct grapheme counts (1 and 2) when using JDK 20:
C:\Java\jdk-20\bin\java.exe -javaagent:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\lib\idea_rt.jar=53642:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\bin -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath D:\II2023.1\Graphemes\out\production\Graphemes Main
System.getProperty("java.version"): 20-ea
Charset.defaultCharset():UTF-8
Code points for the String π©βπ©βπ¦βπ¦:1f469 200d 1f469 200d 1f466 200d 1f466
The length of the String π©βπ©βπ¦βπ¦ using String.length() is 11
The length of the String π©βπ©βπ¦βπ¦ using BreakIterator is 1
Code points for the String π»π©βπ©βπ¦βπ¦:1f47b 1f469 200d 1f469 200d 1f466 200d 1f466
The length of the String π»π©βπ©βπ¦βπ¦ using String.length() is 13
The length of the String π»π©βπ©βπ¦βπ¦ using BreakIterator is 2
Process finished with exit code 0
And here is the output for the identical code showing incorrect grapheme counts (7 and 8) when using JDK 17:
C:\Java\jdk-17.0.2\bin\java.exe -javaagent:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\lib\idea_rt.jar=53775:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\bin -Dfile.encoding=UTF-8 -classpath D:\II2023.1\Graphemes\out\production\Graphemes Main
System.getProperty("java.version"): 17.0.2
Charset.defaultCharset():UTF-8
Code points for the String π©βπ©βπ¦βπ¦:1f469 200d 1f469 200d 1f466 200d 1f466
The length of the String π©βπ©βπ¦βπ¦ using String.length() is 11
The length of the String π©βπ©βπ¦βπ¦ using BreakIterator is 7
Code points for the String π»π©βπ©βπ¦βπ¦:1f47b 1f469 200d 1f469 200d 1f466 200d 1f466
The length of the String π»π©βπ©βπ¦βπ¦ using String.length() is 13
The length of the String π»π©βπ©βπ¦βπ¦ using BreakIterator is 8
Process finished with exit code 0
I tested this in Intellij IDEA 2023.1.1 Preview using Oracle OpenJDK version 20.0.1 and Oracle OpenJDK version 17.0.2
Upvotes: 6
Reputation: 3931
JDK 15 added support for extended grapheme clusters to the java.util.regex
package. Hereβs a solution based on that:
/** Returns the number of grapheme clusters within `text` between positions
* `start` and `end`. Omits any partial cluster at the end of the span.
*/
int columnarSpan( String text, int start, int end ) {
return columnarSpan( text, start, end, /*wholeOnly*/true ); }
/** @param wholeOnly Whether to omit any partial cluster at the end
* of the span. Iff `true` and `end` bisects the final cluster,
* then the final cluster is omitted from the count.
*/
int columnarSpan( final String text, final int start, final int end,
final boolean wholeOnly ) {
graphemeMatcher.reset( text ).region( start, end );
int count = 0;
while( graphemeMatcher.find() ) ++count;
if( wholeOnly && count > 0 && end < text.length() ) {
final int countNext = columnarSpan( text, start, end + 1, false );
if( countNext == count ) --count; } /* The character at `end` bisects
the final cluster, which therefore lies partly outside the span.
Therefore exclude it from the count. */
return count; }
final Matcher graphemeMatcher = graphemePattern.matcher( "" );
/** The pattern of a grapheme cluster.
*/
public static final Pattern graphemePattern = Pattern.compile( "\\X" ); } /*
An alternative means of cluster discovery is `java.txt.BreakIterator`.
Long outdated in this regard, [https://bugs.openjdk.org/browse/JDK-8174266]
it was updated for JDK 20. [https://stackoverflow.com/a/76109241/2402790] */
Call it like this:
String emoji = "π»π©βπ©βπ¦βπ¦";
int count = columnarSpan( emoji, 0, /*end*/emoji.length() );
System.out.println( count );
β 2
Note that it counts whole clusters only. If the given end
bisects the final cluster β the character at position end
being part of the same extended cluster as the preceding character β then the final cluster is omitted from the count. For example:
int count = columnarSpan( emoji, 0, /*end*/emoji.length() - 1 );
System.out.println( count );
β 1
This is generally the behaviour you want in order to print a line of text with a character pointer positioned beneath it (e.g. β^
β) pointing into the cluster of the character at the given index. To defeat this behaviour (pointing after the cluster), call the base method as follows.
int count = columnarSpan( emoji, 0, /*end*/emoji.length() - 1, false );
System.out.println( count );
β 2
(Updated as per Skomisaβs comment.)
Upvotes: 4