Reputation: 31
I'm trying to read hebrew from a text file:
def task1():
f = open('C:\\Users\\royi\\Desktop\\final project\\corpus-haaretz.txt', 'r',"utf-8")
print 'success'
return f
a = task1()
When i read it it shows me this:
'[\xee\xe0\xee\xf8 \xee\xf2\xf8\xeb\xfa \xf9\xec \xe4\xf0\xe9\xe5-\xe9\xe5\xf8\xf7 \xe8\xe9\xe9\xee\xf1: \xf2\xec \xe1\xe9\xfa \xe4\xee\xf9\xf4\xe8 \xec\xe1\xe8\xec \xe0\xfa \xe7\xe5\xf7 \xe4\xe7\xf8\xed, \xec\xe8\xe5\xe1\xfa \xe9\xf9\xf8\xe0\xec \xee\xe0\xfa \xf0\xe9\xe5
and many more.
how do i read it?
Upvotes: 3
Views: 7900
Reputation: 82934
Your description of how you read the file appears to be incorrect. It is puzzling that "it" manages to show you bytes that are obviously Hebrew text encoded in cp1255.
We need to be shown unambiguously what is in the first few (say 200) bytes of your file. Please run one of the following commands in a Command Prompt window, depending on what Python you are using:
Python 2.x (assuming 2.7 installed in the standard place):
prompt>c:\python27\python -c "import locale; print locale.getpreferredencoding(), repr(open('your_file.txt', 'rb').read(200))"
or Python 3.x
prompt>c:\python32\python -c "import locale; print(locale.getpreferredencoding(),ascii(open('your_file.txt', 'rb').read(200)))"
Edit your question and (1) copy/paste the output from the command (2) tell us what version of Python you are using.
Upvotes: 0
Reputation: 596793
You print it like this:
print task1().encode('your terminal encoding here')
You must be sure that your terminal is able to display hebrew characters. For exemple, under an full utf-8 Linux distrib with hebrew locales installed:
print task1().encode('utf-8')
Careful with open
:
codecs
module.open(path, 'r', encoding='utf-8')
. You can even omit 'r'
.So why would you use encode
?
Well, when you read a file and tell Python the encoding, it returns a unicode object, not string object. For example on my system:
>>> import codecs
>>> content = codecs.open('/etc/fstab', encoding='utf-8').read()
>>> type(content)
<type 'unicode'>
>>> type('')
<type 'str'>
>>> type(u'')
<type 'unicode'>
You need to encode it back to a string if you want to make it a printable string if it contains non ascii characters:
>>> type(content.encode('utf-8'))
<type 'str'>
We use encode
because here we are talking a more or less generic text object (unicode is as generic as you can get with text manipulation), and you turn it (encode) in a specific representation (utf-8).
And we need this specifi representation because your system doesn't nkow about Python internal and can only print ascii characters if you don't specify the encoding. So when you ouput, you encode specifically to an encoding your system can understand. For me it's luckly 'utf-8', so it's easy. If you are on Windows, it can get tricky.
Upvotes: 5
Reputation: 43041
From the look of it, it seems to me that the encoding of the string you get is 'windows-1255'
, not 'utf-8'
. Try to open the file using that encoding instead.
Upvotes: 1