rampion
rampion

Reputation: 89053

How can I iterate through the unicode codepoints of a Java String?

So I know about String#codePointAt(int), but it's indexed by the char offset, not by the codepoint offset.

I'm thinking about trying something like:

But my concerns are

Upvotes: 117

Views: 40285

Answers (4)

Alex - GlassEditor.com
Alex - GlassEditor.com

Reputation: 15507

CharSequence#codePointsIntStream

Java 8 added CharSequence#codePoints which returns an IntStream containing the code points.

You can use the stream directly to iterate over them:

string.codePoints().forEach(c -> …);

Or, use a for loop by collecting the stream into an array:

for(int codePoint : string.codePoints().toArray()){ … }

See example run at Ideone.com:

String input = "Café ☕" ;
for ( int codePoint : input.codePoints().toArray() )
{
    System.out.println
    ( 
        Character.toString( codePoint ) + 
        " = # " + codePoint + 
        " " + Character.getName( codePoint ) 
    );
}
C = # 67 LATIN CAPITAL LETTER C
a = # 97 LATIN SMALL LETTER A
f = # 102 LATIN SMALL LETTER F
é = # 233 LATIN SMALL LETTER E WITH ACUTE
  = # 32 SPACE
☕ = # 9749 HOT BEVERAGE

These ways are probably more expensive than Jonathan Feinbergs's solution, but they are faster to read/write and the performance difference will usually be insignificant.

Upvotes: 88

Alexander Egger
Alexander Egger

Reputation: 5300

Iterating over code points is filed as a feature request at Sun.

See Bug Report

There is also an example on how to iterate over String CodePoints there.

Upvotes: 6

rogerdpack
rogerdpack

Reputation: 66741

Thought I'd add a workaround method that works with foreach loops (ref), plus you can convert it to java 8's new String#codePoints method easily when you move to java 8:

You can use it with foreach like this:

 for(int codePoint : codePoints(myString)) {
   ....
 }

Here's the method:

public static Iterable<Integer> codePoints(final String string) {
  return new Iterable<Integer>() {
    public Iterator<Integer> iterator() {
      return new Iterator<Integer>() {
        int nextIndex = 0;
        public boolean hasNext() {
          return nextIndex < string.length();
        }
        public Integer next() {
          int result = string.codePointAt(nextIndex);
          nextIndex += Character.charCount(result);
          return result;
        }
        public void remove() {
          throw new UnsupportedOperationException();
        }
      };
    }
  };
}

Or alternately if you just want to convert a string to an array of int codepoints (if your code could use a codepoint int array more easily) (might use more RAM than the above approach):

 public static List<Integer> stringToCodePoints(String in) {
    if( in == null)
      throw new NullPointerException("got null");
    List<Integer> out = new ArrayList<Integer>();
    final int length = in.length();
    for (int offset = 0; offset < length; ) {
      final int codepoint = in.codePointAt(offset);
      out.add(codepoint);
      offset += Character.charCount(codepoint);
    }
    return out;
  }

Thankfully uses "codePointAt" which safely handles the surrogate pair-ness of UTF-16 (java's internal string representation).

Upvotes: 10

Jonathan Feinberg
Jonathan Feinberg

Reputation: 45324

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.

If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

Upvotes: 158

Related Questions