USB_S0lderer
USB_S0lderer

Reputation: 170

Index strings by letter including diacritics

I'm not sure how to formulate this question, but I'm looking for a magic function that makes this code

for x in magicfunc("H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"):
    print(x)

Behave like this:

H̶
e̕
l̛
l͠
o͟
 ̨
w̡
o̷
r̀
l҉
ḑ
!͜

Basically, is there a built in unicode function or method that takes a string and outputs an array per glyph with all their respective unicode decorators and diacritical marks and such? The same way that a text editor moves the cursor over to the next letter instead of iterating all of the combining characters.

If not, I'll write the function myself, no help needed. Just wondering if it already exists.

Upvotes: 2

Views: 76

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177901

The 3rd party regex module can search by glyph:

>>> import regex
>>> s="H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"
>>> for x in regex.findall(r'\X',s):
...  print(x)
...
H̶
e̕
l̛
l͠
o͟
 ̨
w̡
o̷
r̀
l҉
ḑ
!͜

Upvotes: 1

njzk2
njzk2

Reputation: 39406

You can use unicodedata.combining to find out if a character is combining:

def combine(s: str) -> Iterable[str]:
  buf = None
  for x in s:
    if unicodedata.combining(x) != 0:
      # combining character
      buf += x
    else:
      if buf is not None:
        yield buf
      buf = x
  if buf is not None:
    yield buf

Result:

>>> for x in combine("H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"):
...     print(x)
... 
H̶
e̕
l̛
l͠
o͟
 ̨
w̡
o̷
r̀
l

ḑ
!͜

Issue is that COMBINING CYRILLIC MILLIONS SIGN is not recognized as combining, not sure why. You could also test if COMBINING is in the unicodedata.name(x) for the character, that should solve it.

Upvotes: 1

Related Questions