Reputation: 12672
I'm writing a lexical analyzer for Unicode text. Many Unicode characters require multiple code points (even after canonical composition). For example, tuple(map(ord, unicodedata.normalize('NFC', 'ā́')))
evaluates to (257, 769)
. How can I know where the boundary is between two characters? Additionally, I'd like to store the unnormalized version of the text. My input is guaranteed to be Unicode.
So far, this is what I have:
from unicodedata import normalize
def split_into_characters(text):
character = ""
characters = []
for i in range(len(text)):
character += text[i]
if len(normalize('NFKC', character)) > 1:
characters.append(character[:-1])
character = character[-1]
if len(character) > 0:
characters.append(character)
return characters
print(split_into_characters('Puélla in vī́llā vīcī́nā hábitat.'))
This incorrectly prints the following:
['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī', '́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī', '́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']
I expect it to print the following:
['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']
Upvotes: 2
Views: 371
Reputation: 21
You may want to use the pyuegc library, an implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29.
from pyuegc import EGC # pip install pyuegc
string = 'Puélla in vī́llā vīcī́nā hábitat.'
egc = EGC(string)
print(egc)
# ['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']
print(len(string))
# 35
print(len(egc))
# 31
Upvotes: 1
Reputation: 21259
The boundaries between perceived characters can be identified with Unicode's Grapheme Cluster Boundary algorithm. Python's unicodedata
module doesn't have the necessary data for the algorithm (the Grapheme_Cluster_Break
property), but complete implementations can be found in libraries like PyICU
and uniseg
.
Upvotes: 4