Reputation: 9
I am performing some NER on Arabic language. The code is as follows:
from polyglot.text import Text
blob = "مرحبا اسمي rahul agnihotri أنا عمري 41 سنة و الهندية"
text = Text(blob)
text = Text(blob, hint_language_code='ar') #ar stands for arabic
print(text.entities)
After executing above given code in ubuntu i get below given error:
SyntaxError: Non-ASCII character '\xd9' in file ./ner.py on line 4, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
However, if I include # -- coding: utf-8 -- it works and here is the output:
[I-LOC([u'\u0627\u0644\u0647\u0646\u062f\u064a\u0629'])]
This is not the desired ouptut i am looking for. The desired output should in Arabic language not this way.
FYI: All required libraries are installed.
Upvotes: 0
Views: 1292
Reputation: 11
in python, you can get the Arabic text again by decoding these bytes
Str = "\u0627\u0644\u0647\u0646\u062f\u064a\u0629";
Str = Str.encode('UTF-8','strict');
print (Str.decode("utf-8"))
the output will be
الهندية
I hope this is what you are looking for
Upvotes: 0
Reputation: 109
Utf-8 encoded text must be decoded. What you're seeing when you print is the encoding. So it must be decoded. I am not familiar with polyglot, and I cannot confirm this, but please try this.
if you want to eliminate the file encoding dependency then after you set Blob, use: blob.encode('utf-8') and then later to decode the utf-8 for print use: print(text.entities.decode('utf-8'))
Upvotes: 1