Abdulaziz Al Jumaia
Abdulaziz Al Jumaia

Reputation: 436

How to make a dictionary that contains an Arabic diacritic as a key in python

I am trying to make a program that converts the Arabic diacritics and letters into the Latin script. The letters work well in the program, but the diacritics can not be converted as I get an error every time I run the program.

At the beginning, I put the diacritics alone as keys but that did not work with me. please, see the last key, it contains َ ,which is a diacritic, but do not work properly as the letters:

def convert(lit):
    ArEn = {'ا':'A', 'ل':'L', "و": "W", "َ":"a"}
    end_word=[]
    for i in range(len(lit)):
        end_word.append(ArEn[lit[i]])
        jon = ""

    print(jon.join(end_word))

convert("الوَ")

However, I tried to fix the problem by using letters attached with diacritics as keys, but the program resulted in the same error:

the dictionary:

ArEn = {'ا':'A', 'ل':'L', "وَ":"Wa"}

the error:

    Traceback (most recent call last):
  File "C:\Users\Abdulaziz\Desktop\converter AR to EN SC.py", line 10, in <module>
    convert("الوَ")
  File "C:\Users\Abdulaziz\Desktop\converter AR to EN SC.py", line 5, in convert
    end_word.append(ArEn[lit[i]])
KeyError: 'و'

Upvotes: 1

Views: 1923

Answers (3)

Abdulaziz Al Jumaia
Abdulaziz Al Jumaia

Reputation: 436

Update: I just noticed, after years, that the letters and diacritics are put together in the first try. When I separated them, the program worked.

I just solved the problem! I am not really sure if it is a mistake in python or something else, but as far as I know python does not support Arabic very well. Or maybe I made a problem in the program above.

I kept writing the same program and suddenly it worked very well. I even added different diacritics and they worked properly.

    def convert(lit):
    ArEn = {'ا':'A', 'ل':'L', "و":"W", "َ":"a", "ُ":"w", "":""}
    end_word=[]
    for i in range(len(lit)):
        end_word.append(ArEn[lit[i]])
        jon = ""

    print(jon.join(end_word))

convert("اُلوَ")

the reult is

AwLWa

Upvotes: 0

Nasser Al-Wohaibi
Nasser Al-Wohaibi

Reputation: 4661

You might need proper indentation in python:

def convert(lit):
    ArEn = {'ا':'A', 'ل':'L', "و":"W", "َ":"a", "ُ":"w", "":""}
    end_word=[]
    for i in range(len(lit)):
        end_word.append(ArEn[lit[i]])
        jon = ""

    print(jon.join(end_word))

convert("اُلوَ")

Upvotes: 0

jsbueno
jsbueno

Reputation: 110271

The chances are rather there is a bug in the programing-code editor you are using for coding Python than on Pyhton itself. Since you are using Python-3.x, the diacritics from the running progam point of view are just a single character, like any other, and there should be no issues at all.

From the cod-editor point of view, there are issues such as whether to advance one character when displaying certain special unicode characters or not, and maybe the " character itself can be show out of space - when one tries to manually correct the position of the ", one could place it out of order, leaving the special character actually outside the quoted string -

The fact you could solve the issue by re-editing the file suggests that is indeed what happened.

One way to avoid this is to put certain special characters - specially ones that have different displaying rules, is to escape then with the "\uxxxx" unicode codepoint unicode sequence. This will avoid yourself or other persons having issues when editing your file again in the future, since even i yu get it working now, the editor may show then incorrectly when they are opened, and by trying to fix it one might break the syntax again.

You can use a table on the web or Python3's interactive prompt to get the unicode codepoint of each character, ensuring the code part of the program is displayed in a deterministic way in any editor - (if you add the diacritical char as a comment on the same line, it will actually enhance the readability of your code - enormously if it is ever supposed to be edited by non Arabic speakers)

So, your above declaration, I used this snippet to extract the codepoints:

>>> ArEn = {'ا':'A', 'ل':'L', "و": "W", "َ":"a"}
>>> [print (hex(ord(yy)), yy ) for yy in ArEn.keys()]

0x648 و
0x644 ل
0x64e َ
0x627 ا

Which allows me to declare the dictionary like this:

ArEn = {
 "\u0648": "W",    # و
 "\u0644": "L",    # L
 "\u064e": "a",    #  ۮ
 "\u0627": "A",   # ا
}

(And yes, I had trouble with displaying the characters on my terminal like I said you probably had on your editor while getting these - the fatha ("\u064e" - "a") character is tricky ! :-) )

Alternatively for using the codepoints in your code, is to use Python's unicode data module to discover and them use the actual character names - this can enhance readability further, and maybe by exploring unicodedata you can find out you don't even have to create this dictionary manually, but use that module instead -

In [16]: [print("\\u{:04x} - '{}' - {}".format(ord(yy), unicodedata.name(yy),  yy) ) for yy in ArEn.keys()]
\u0648 - 'ARABIC LETTER WAW' - و
\u0644 - 'ARABIC LETTER LAM' - ل
\u064e - 'ARABIC FATHA' - َ
\u0627 - 'ARABIC LETTER ALEF' - ا

And from these full text names, you can get back to the character with the unicodedata.lookup function:

>>> unicodedata.lookup("ARABIC LETTER LAM")
 'ل'

notes: 1) This requires Python3 - for Python2 one might try to prefix each string with u"" - but one dealign with these characters is far better off using Python 3, since unicode support is one of the big deals with it. 2) This also requires a terminal with a nice support for unicode characters using "utf-8" encoding - I am on a Linux system with the "konsole" terminal. On Windows, the idle Python prompt might work, but not the cmd Python prompt.

Upvotes: 1

Related Questions