Coddy
Coddy

Reputation: 891

Reading arabic text encoded in utf-8 in python

I am using Python 2.7. I have got the following line (string) from a text file encoded in utf-8:

"تازہ ترین خبروں، بریکنگ نیوز، ویڈیو، آڈیو، فیچر اور تجزیوں کے لیے بی بی سی اردو"

I am using the following code to print it on the screen:

import codecs
filename = codecs.open('file path', 'r', encoding="utf-8")
outputfile = filename.readlines()
print outputfile

It gives the following output:

[u'\ufeff\u062a\u0627\u0632\u06c1 \u062a\u0631\u06cc\u0646 \u062e\u0628\u0631\u0648\u06ba\u060c \u0628\u0631\u06cc\u06a9\u0646\u06af \u0646\u06cc\u0648\u0632\u060c \u0648\u06cc\u0688\u06cc\u0648\u060c \u0622\u0688\u06cc\u0648\u060c \u0641\u06cc\u0686\u0631 \u0627\u0648\u0631 \u062a\u062c\u0632\u06cc\u0648\u06ba \u06a9\u06d2 \u0644\u06cc\u06d2 \u0628\u06cc \u0628\u06cc \u0633\u06cc \u0627\u0631\u062f\u0648 \u06a9\u06cc \u0648\u06cc\u0628']

The purpose is to print the text correctly, and not how to print each line. So, how can I print the string or content of text file correctly in its original form? like:

تازہ ترین خبروں، بریکنگ نیوز، ویڈیو، آڈیو، فیچر اور تجزیوں کے لیے بی بی سی اردو     

Upvotes: 5

Views: 7227

Answers (2)

Mark Ransom
Mark Ransom

Reputation: 308140

readlines() returns a list. When you print a list, it prints the repr() of each item in the list. The repr of a string is encoded the way you see here to make sure it's not dependent on system encoding. You want to print the string directly:

print outputfile[0]

Upvotes: 1

aIKid
aIKid

Reputation: 28252

What you see is just the representation of the string. Since you're printing the list, the one shown is the representation, not the readable form.

You can print it normally, for each lines:

for line in outputfile:
    print(line)

Demo:

>>> s = u'\ufeff\u062a\u0627\u0632\u06c1 \u062a\u0631\u06cc\u0646 \u062e\u0628\u0631\u0648\u06ba\u060c \u0628\u0631\u06cc\u06a9\u0646\u06af \u0646\u06cc\u0648\u0632\u060c \u0648\u06cc\u0688\u06cc\u0648\u060c \u0622\u0688\u06cc\u0648\u060c \u0641\u06cc\u0686\u0631 \u0627\u0648\u0631 \u062a\u062c\u0632\u06cc\u0648\u06ba \u06a9\u06d2 \u0644\u06cc\u06d2 \u0628\u06cc \u0628\u06cc \u0633\u06cc \u0627\u0631\u062f\u0648 \u06a9\u06cc \u0648\u06cc\u0628'

>>> print(s)
تازہ ترین خبروں، بریکنگ نیوز، ویڈیو، آڈیو، فیچر اور تجزیوں کے لیے بی بی سی اردو کی ویب

Upvotes: 4

Related Questions