Amir Mohsen
Amir Mohsen

Reputation: 851

python arabic encoding issue

i have a text with windows-1256 encoding. now i want to convert text from arabic(windows-1256) to utf-8

sample text :

Óæí Ïæã ÈíåÞí

result :

سوي دوم بيهقي

i use this code to decode and encod to utf-8

# -*- coding: utf-8 -*-

data = "Óæí Ïæã ÈíåÞí"
print data.decode("windows-1256", "replace")
print data.encode("windows-1256")

that code return this result:

أ“أ¦أ­ أڈأ¦أ£ أˆأ­أ¥أ‍أ­
Traceback (most recent call last):
  File "mohmal2.py", line 5, in <module>
    print data.encode("windows-1256")
  File "/usr/lib/python2.7/encodings/cp1256.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

i found a site that can convert this text:

http://www.iosart.com

Upvotes: 9

Views: 18634

Answers (3)

Sara M
Sara M

Reputation: 1

Following Josh Lee's solution, I used this line for reading csv with Farsi(Persian) characters and it worked pretty well:

    df = pd.read_csv(r'C:\Users\FILE PATH\FILE NAME\.csv',encoding='cp1256')

Upvotes: 0

I would like to add to @josh-lee answer the case for python2.
If you are using python 2, add unicode prefix u.

>>> u"Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
u'\u0633\u0648\u064a \u062f\u0648\u0645 \u0628\u064a\u0647\u0642\u064a'
>>> print _
سوي دوم بيهقي

Upvotes: 4

Josh Lee
Josh Lee

Reputation: 177564

It looks like you have accidentally decoded the input as Windows-1252.

>>> "Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
'سوي دوم بيهقي'

Upvotes: 12

Related Questions