Reputation: 197
I have a simple Python (2.7.10) program like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
with open("test.txt") as f:
input = f.readlines()
for i in input:
l = list(i)
mystring = ""
for j in l:
mystring += j
print mystring, '\n',
The text file, 'test.txt', contains this:
AAAÖÖAAA
When I run the code however, each time 'Ö' is at the end of mystring, it's being output as '?', like this:
A
AA
AAA
AAA?
AAAÖ
AAAÖ?
AAAÖÖ
AAAÖÖA
AAAÖÖAA
AAAÖÖAAA
AAAÖÖAAA
If I run the code on Python 3 instead (having to change the print statement to
'print (mystring),
'), the output is correct:
A
AA
AAA
AAAÖ
AAAÖÖ
AAAÖÖA
AAAÖÖAA
AAAÖÖAAA
AAAÖÖAAA
Does anybody know why this is happening and how to fix it? I've tried googling but haven't really found anything.
Upvotes: 2
Views: 70
Reputation: 1121584
You are printing UTF-8 bytes.
UTF-8 is a variable-byte encoding; it'll use anywhere between 1 and 4 bytes to encode a given Unicode codepoint. The Ö
is encoded to two bytes in UTF-8, while the letter A
requires only one:
>>> u'Ö'.encode('utf8')
'\xc3\x96'
>>> u'A'.encode('utf8')
'A'
Printing just the first byte (hexadecimal C3) is not valid UTF-8 output, so your terminal uses a ?
to indicate it cannot decode the data you printed. On my Mac terminal, the U+FFFD REPLACEMENT CHARACTER �
character is printed in that case:
>>> print u'Ö'.encode('utf8')
Ö
>>> print u'Ö'.encode('utf8')[0]
�
>>> print u'Ö'.encode('utf8')[1]
�
If you first decode your data to a unicode
object, you can iterate over codepoints rather than over bytes:
for i in input:
l = list(i.decode('utf8'))
Note that you don't have to call list()
on the object just to iterate. Looping over a string already gives you individual characters.
You could also open the file with io.open()
; this gives you a file object that gives you unicode
objects by default when reading, provided you tell it what codec to use:
import io
with io.open("test.txt", encoding='utf8') as f:
input = f.readlines()
Upvotes: 3
Reputation: 17751
Another way to fix the problem for Python 2.
Instead of opening the file with open()
...
with open("test.txt") as f:
input = f.readlines()
... use io.open()
:
import io
with codecs.open('/tmp/b') as f:
input = f.readlines()
io.open()
has the same behavior as Python 3's open()
builtin.
Upvotes: 3