peter.murray.rust
peter.murray.rust

Reputation: 38033

creating and using Strings with surrogate pairs

I have to work with codepoints above 0FFFF (specifically math scripted characters) and have not found simple tutorials on how to do this. I want to be able to (a) create Strings with high codepoints and (b) iterate over the characters in them. Since char cannot hold these points my code looks like:

    @Test
public void testSurrogates() throws IOException {
    // creating a string
    StringBuffer sb = new StringBuffer();
    sb.append("a");
    sb.appendCodePoint(120030);
    sb.append("b");
    String s = sb.toString();
    System.out.println("s> "+s+" "+s.length());
    // iterating over string
    int codePointCount = s.codePointCount(0, s.length());
    Assert.assertEquals(3, codePointCount);
    int charIndex = 0;
    for (int i = 0; i < codePointCount; i++) {
        int codepoint = s.codePointAt(charIndex);
        int charCount = Character.charCount(codepoint);
        System.out.println(codepoint+" "+charCount);
        charIndex += charCount;
    }
}

I don't feel comfortable that this is either fully correct or the cleanest way to do this. I would have expected methods such as codePointAfter() but there is only a codePointBefore(). Please confirm that this is the right strategy or give an alternate one.

UPDATE: Thanks for the confirmation @Jon. I struggled with this - here are two mistakes to avoid:

Upvotes: 4

Views: 681

Answers (1)

Jon Skeet
Jon Skeet

Reputation: 1499730

It looks correct to me. If you want to iterate over the code points in a string, you could wrap this code in an Iterable:

public static Iterable<Integer> getCodePoints(final String text) {
    return new Iterable<Integer>() {
        @Override public Iterator<Integer> iterator() {
            return new Iterator<Integer>() {
                private int nextIndex = 0;

                @Override public boolean hasNext() {
                    return nextIndex < text.length();
                }

                @Override public Integer next() {
                    if (!hasNext()) {
                        throw new NoSuchElementException();
                    }
                    int codePoint = text.codePointAt(nextIndex);
                    nextIndex += Character.charCount(codePoint);
                    return codePoint;
                }

                @Override public void remove() {
                    throw new UnsupportedOperationException();
                }
            };
        }
    };
}

Or you could change the method to just return an int[] of course:

public static int[] getCodePoints(String text) {
    int[] ret = new int[text.codePointCount(0, text.length())];
    int charIndex = 0;
    for (int i = 0; i < ret.length; i++) {
        ret[i] = text.codePointAt(charIndex);
        charIndex += Character.charCount(ret[i]);
    }
    return ret;
}

I agree that it's a pity that the Java libraries don't expose methods like this already, but at least they're not too hard to write.

Upvotes: 5

Related Questions