royi
royi

Reputation: 31

Why don't I see the hebrew characters, when I print text from an utf-8 file in Python?

I'm trying to read hebrew from a text file:

def task1():
    f = open('C:\\Users\\royi\\Desktop\\final project\\corpus-haaretz.txt', 'r',"utf-8")
    print 'success'
    return f

a = task1()

When i read it it shows me this:

'[\xee\xe0\xee\xf8 \xee\xf2\xf8\xeb\xfa \xf9\xec \xe4\xf0\xe9\xe5-\xe9\xe5\xf8\xf7 \xe8\xe9\xe9\xee\xf1: \xf2\xec \xe1\xe9\xfa \xe4\xee\xf9\xf4\xe8 \xec\xe1\xe8\xec \xe0\xfa \xe7\xe5\xf7 \xe4\xe7\xf8\xed, \xec\xe8\xe5\xe1\xfa \xe9\xf9\xf8\xe0\xec \xee\xe0\xfa \xf0\xe9\xe5 

and many more.

how do i read it?

Upvotes: 3

Views: 7900

Answers (4)

John Machin
John Machin

Reputation: 82934

Your description of how you read the file appears to be incorrect. It is puzzling that "it" manages to show you bytes that are obviously Hebrew text encoded in cp1255.

We need to be shown unambiguously what is in the first few (say 200) bytes of your file. Please run one of the following commands in a Command Prompt window, depending on what Python you are using:

Python 2.x (assuming 2.7 installed in the standard place):

prompt>c:\python27\python -c "import locale; print locale.getpreferredencoding(), repr(open('your_file.txt', 'rb').read(200))"

or Python 3.x

prompt>c:\python32\python -c "import locale; print(locale.getpreferredencoding(),ascii(open('your_file.txt', 'rb').read(200)))"

Edit your question and (1) copy/paste the output from the command (2) tell us what version of Python you are using.

Upvotes: 0

Bite code
Bite code

Reputation: 596793

You print it like this:

print task1().encode('your terminal encoding here')

You must be sure that your terminal is able to display hebrew characters. For exemple, under an full utf-8 Linux distrib with hebrew locales installed:

print task1().encode('utf-8')

Careful with open:

  • with python 2.7, you have no encoding parameter. Use the codecs module.
  • with python 3+, the encoding parameter is the fourth one, not the third like you do. You may mean something like open(path, 'r', encoding='utf-8'). You can even omit 'r'.

So why would you use encode ?

Well, when you read a file and tell Python the encoding, it returns a unicode object, not string object. For example on my system:

>>> import codecs
>>> content = codecs.open('/etc/fstab', encoding='utf-8').read()
>>> type(content)
<type 'unicode'>
>>> type('')
<type 'str'>
>>> type(u'')
<type 'unicode'>

You need to encode it back to a string if you want to make it a printable string if it contains non ascii characters:

>>> type(content.encode('utf-8'))
<type 'str'>

We use encode because here we are talking a more or less generic text object (unicode is as generic as you can get with text manipulation), and you turn it (encode) in a specific representation (utf-8).

And we need this specifi representation because your system doesn't nkow about Python internal and can only print ascii characters if you don't specify the encoding. So when you ouput, you encode specifically to an encoding your system can understand. For me it's luckly 'utf-8', so it's easy. If you are on Windows, it can get tricky.

Upvotes: 5

mac
mac

Reputation: 43041

From the look of it, it seems to me that the encoding of the string you get is 'windows-1255', not 'utf-8'. Try to open the file using that encoding instead.

Upvotes: 1

supakeen
supakeen

Reputation: 2914

You need to use the codecs module to open a file. The open() (see docs) call doesn't take a third argument like that, the third argument would be the bufsize.

Specifically codecs.open(). Always decode when you read, encode when you output :-)

Upvotes: 1

Related Questions