Reputation: 613
As a single arabic character can take on multiple glyph forms there are multiple unicode/utf-8 encoding for each form e.g Aleph: Isolated == ا
with utf-8==\xD8\xA7
, Final == ـا
with utf-8==\xD9\x80\xD8\xA7
, Hamza == أ / إ
with utf-8==\xD8\xA5 / \xD8\xA3
, Maddah == آ
with utf-8==\xD8\xA2
, Maqsurah == ى
with utf-8==\xD9\x89
, where the base form would be the isolated aleph with utf-8==\xD8\xA7
.
How can I convert an arabic character to its base glyph form in Python 3?
Upvotes: 0
Views: 594
Reputation: 177610
You can use unicodedata.normalize
to convert code points to their decomposed form, consisting of a base character and a modifier. It doesn't work for all cases (particularly Maqsurah), but could help you write a function to determine some base forms:
>>> s='ـا' # this character already consisted of the base code point.
>>> import unicodedata as ud
>>> for c in s:
... print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...
ـ U+0640 ARABIC TATWEEL
ا U+0627 ARABIC LETTER ALEF
>>> s = 'أإآ' # These characters have decomposed forms
>>> for c in s:
... print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...
أ U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
إ U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW
آ U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE
>>> s = ud.normalize('NFD',s)
>>> for c in s:
... print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...
ا U+0627 ARABIC LETTER ALEF
ٔ U+0654 ARABIC HAMZA ABOVE
ا U+0627 ARABIC LETTER ALEF
ٕ U+0655 ARABIC HAMZA BELOW
ا U+0627 ARABIC LETTER ALEF
ٓ U+0653 ARABIC MADDAH ABOVE
Upvotes: 2