Index strings by letter including diacritics

Question

I'm not sure how to formulate this question, but I'm looking for a magic function that makes this code

for x in magicfunc("H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"):
    print(x)

Behave like this:

H̶
e̕
l̛
l͠
o͟
 ̨
w̡
o̷
r̀
l҉
ḑ
!͜

Basically, is there a built in unicode function or method that takes a string and outputs an array per glyph with all their respective unicode decorators and diacritical marks and such? The same way that a text editor moves the cursor over to the next letter instead of iterating all of the combining characters.

If not, I'll write the function myself, no help needed. Just wondering if it already exists.

njzk2 · Accepted Answer

You can use unicodedata.combining to find out if a character is combining:

def combine(s: str) -> Iterable[str]:
  buf = None
  for x in s:
    if unicodedata.combining(x) != 0:
      # combining character
      buf += x
    else:
      if buf is not None:
        yield buf
      buf = x
  if buf is not None:
    yield buf

Result:

>>> for x in combine("H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"):
...     print(x)
... 
H̶
e̕
l̛
l͠
o͟
 ̨
w̡
o̷
r̀
l

ḑ
!͜

Issue is that COMBINING CYRILLIC MILLIONS SIGN is not recognized as combining, not sure why. You could also test if COMBINING is in the unicodedata.name(x) for the character, that should solve it.

Index strings by letter including diacritics

Answers (2)

Related Questions