Reputation: 424
I want to generate a list of characters in a word without altering special characters like ɑ̃
e.g:
word = "aplavɑ̃tʁɛ"
list(word)
['a', 'p', 'l', 'a', 'v', 'ɑ', '̃', 't', 'ʁ', 'ɛ']
I want to have:
['a', 'p', 'l', 'a', 'v', 'ɑ̃', 't', 'ʁ', 'ɛ']
Upvotes: 1
Views: 259
Reputation: 189689
You want to check if a character is a combining character, and if so, keep it together with the preceding character.
import unicodedata
word = "aplavɑ̃tʁɛ"
characters = []
for char in word:
if unicodedata.combining(char):
characters[-1] += char
else:
characters.append(char)
print(characters)
Result:
['a', 'p', 'l', 'a', 'v', 'ɑ̃', 't', 'ʁ', 'ɛ']
Unicode facilitates the combination of arbitrary glyphs with joining modifiers (diacritics etc) so that you can build up characters with multiple accents, which is frequently required in IPA and e.g. Vietnamese orthography. (This is sometimes abused in something called "zalgotext".)
Upvotes: 1