MShakeG
MShakeG

Reputation: 613

How to convert arabic character to its base glyph form in Python 3?

As a single arabic character can take on multiple glyph forms there are multiple unicode/utf-8 encoding for each form e.g Aleph: Isolated == ا with utf-8==\xD8\xA7, Final == ـا with utf-8==\xD9\x80\xD8\xA7, Hamza == أ / إ with utf-8==\xD8\xA5 / \xD8\xA3, Maddah == آ with utf-8==\xD8\xA2, Maqsurah == ى with utf-8==\xD9\x89, where the base form would be the isolated aleph with utf-8==\xD8\xA7.

How can I convert an arabic character to its base glyph form in Python 3?

Upvotes: 0

Views: 594

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177610

You can use unicodedata.normalize to convert code points to their decomposed form, consisting of a base character and a modifier. It doesn't work for all cases (particularly Maqsurah), but could help you write a function to determine some base forms:

>>> s='ـا' # this character already consisted of the base code point.
>>> import unicodedata as ud
>>> for c in s:
...     print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...     
ـ U+0640 ARABIC TATWEEL
ا U+0627 ARABIC LETTER ALEF

>>> s = 'أإآ' # These characters have decomposed forms
>>> for c in s:
...     print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...     
أ U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
إ U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW
آ U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE
>>> s = ud.normalize('NFD',s)
>>> for c in s:
...     print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...     
ا U+0627 ARABIC LETTER ALEF
ٔ  U+0654 ARABIC HAMZA ABOVE
ا U+0627 ARABIC LETTER ALEF
ٕ  U+0655 ARABIC HAMZA BELOW
ا U+0627 ARABIC LETTER ALEF
ٓ  U+0653 ARABIC MADDAH ABOVE

Upvotes: 2

Related Questions