rahulagnihotri
rahulagnihotri

Reputation: 9

Python and Name Entity Recognition for Arabic Language

I am performing some NER on Arabic language. The code is as follows:

from polyglot.text import Text
blob = "مرحبا اسمي rahul agnihotri أنا عمري 41 سنة و الهندية"
text = Text(blob)
text = Text(blob, hint_language_code='ar') #ar stands for arabic
print(text.entities)

After executing above given code in ubuntu i get below given error:

SyntaxError: Non-ASCII character '\xd9' in file ./ner.py on line 4, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

However, if I include # -- coding: utf-8 -- it works and here is the output:

[I-LOC([u'\u0627\u0644\u0647\u0646\u062f\u064a\u0629'])]

This is not the desired ouptut i am looking for. The desired output should in Arabic language not this way.

FYI: All required libraries are installed.

Upvotes: 0

Views: 1292

Answers (2)

alhnoof mohamd
alhnoof mohamd

Reputation: 11

in python, you can get the Arabic text again by decoding these bytes

Str = "\u0627\u0644\u0647\u0646\u062f\u064a\u0629";
Str = Str.encode('UTF-8','strict');

print (Str.decode("utf-8"))

the output will be

الهندية

I hope this is what you are looking for

Upvotes: 0

SteamyThePunk
SteamyThePunk

Reputation: 109

Utf-8 encoded text must be decoded. What you're seeing when you print is the encoding. So it must be decoded. I am not familiar with polyglot, and I cannot confirm this, but please try this.

if you want to eliminate the file encoding dependency then after you set Blob, use: blob.encode('utf-8') and then later to decode the utf-8 for print use: print(text.entities.decode('utf-8'))

Upvotes: 1

Related Questions