Given a list of Unicode code points, how does one split them into a list of Unicode characters?

Question

I'm writing a lexical analyzer for Unicode text. Many Unicode characters require multiple code points (even after canonical composition). For example, tuple(map(ord, unicodedata.normalize('NFC', 'ā́'))) evaluates to (257, 769). How can I know where the boundary is between two characters? Additionally, I'd like to store the unnormalized version of the text. My input is guaranteed to be Unicode.

So far, this is what I have:

from unicodedata import normalize

def split_into_characters(text):
    character = ""
    characters = []

    for i in range(len(text)):
        character += text[i]

        if len(normalize('NFKC', character)) > 1:
            characters.append(character[:-1])
            character = character[-1]

    if len(character) > 0:
        characters.append(character)

    return characters

print(split_into_characters('Puélla in vī́llā vīcī́nā hábitat.'))

This incorrectly prints the following:

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī', '́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī', '́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

I expect it to print the following:

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

一二三 · Accepted Answer

The boundaries between perceived characters can be identified with Unicode's Grapheme Cluster Boundary algorithm. Python's unicodedata module doesn't have the necessary data for the algorithm (the Grapheme_Cluster_Break property), but complete implementations can be found in libraries like PyICU and uniseg.

Given a list of Unicode code points, how does one split them into a list of Unicode characters?

Answers (2)

Related Questions