Logan Murphy
Logan Murphy

Reputation: 6230

How do I iterate through a String, so that combining characters stay with their base characters?

I am attempting to iterate the following string:

mɔ̃tr

But no matter what I do, it always ends up getting processed as:

m ɔ ̃ t r

The tilde seems to detach from the reversed c.

One of my first attempts was to do the following:

"mɔ̃tr".map {
    print(it)
}

The tilde would not stay with the reversed c.

I saw suggestions for the following iterator:

fun codePoints(string: String): Iterable<String> {
    return object : Iterable<String> {
        override fun iterator(): MutableIterator<String> {
            return object : MutableIterator<String> {
                var nextIndex = 0
                override fun hasNext(): Boolean {
                    return nextIndex < string.length
                }

                override fun next(): String {
                    val result = string.codePointAt(nextIndex)
                    nextIndex += Character.charCount(result)
                    return String(Character.toChars(result))
                }

                override fun remove() {
                    throw UnsupportedOperationException()
                }
            }
        }
    }
}

But this gave the same output as the previous example.

I have been stuck on this seemingly simple problem for a day now, I just want to process this string as though it had 4 characters, not 5.

Any tips?

Upvotes: 3

Views: 1039

Answers (1)

Sweeper
Sweeper

Reputation: 273540

"ɔ̃" consists of two Unicode code points. This is why the code point iterator you showed still treats ɔ̃ as separate.

"ɔ̃" is a single grapheme cluster. To iterate over those, you need a java.text.BreakIterator. In the documentation, there is an example that shows you how.

public static void printEachForward(BreakIterator boundary, String source) {
    int start = boundary.first();
    for (int end = boundary.next();
         end != BreakIterator.DONE;
         start = end, end = boundary.next()) {
         System.out.println(source.substring(start,end));
    }
}

In Kotlin, you can write an extension function on String that returns you a Sequence of the grapheme clusters.

fun String.graphemeClusterSequence() = sequence {
    val iterator = BreakIterator.getCharacterInstance()
    iterator.setText(this@graphemeClusterSequence)
    var start = iterator.first()
    var end = iterator.next()
    while (end != BreakIterator.DONE) {
        yield([email protected](start, end))
        start = end
        end = iterator.next()
    }
}

Now "mɔ̃tr".graphemeClusterSequence().forEach { println(it) } prints:

m
ɔ̃
t
r

Upvotes: 12

Related Questions