'Foreign' characters lost when they're at the final position in a list

Question

I have a simple Python (2.7.10) program like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

with open("test.txt") as f:
    input = f.readlines()

for i in input:
    l = list(i)
    mystring = ""
    for j in l:
        mystring += j
        print mystring, '
',

The text file, 'test.txt', contains this:

AAAÖÖAAA

When I run the code however, each time 'Ö' is at the end of mystring, it's being output as '?', like this:

A 
AA 
AAA 
AAA? 
AAAÖ 
AAAÖ? 
AAAÖÖ 
AAAÖÖA 
AAAÖÖAA 
AAAÖÖAAA 
AAAÖÖAAA

If I run the code on Python 3 instead (having to change the print statement to 'print (mystring),'), the output is correct:

A
AA
AAA
AAAÖ
AAAÖÖ
AAAÖÖA
AAAÖÖAA
AAAÖÖAAA
AAAÖÖAAA

Does anybody know why this is happening and how to fix it? I've tried googling but haven't really found anything.

Martijn Pieters · Accepted Answer

You are printing UTF-8 bytes.

UTF-8 is a variable-byte encoding; it'll use anywhere between 1 and 4 bytes to encode a given Unicode codepoint. The Ö is encoded to two bytes in UTF-8, while the letter A requires only one:

>>> u'Ö'.encode('utf8')
'\xc3\x96'
>>> u'A'.encode('utf8')
'A'

Printing just the first byte (hexadecimal C3) is not valid UTF-8 output, so your terminal uses a ? to indicate it cannot decode the data you printed. On my Mac terminal, the U+FFFD REPLACEMENT CHARACTER � character is printed in that case:

>>> print u'Ö'.encode('utf8')
Ö
>>> print u'Ö'.encode('utf8')[0]
�
>>> print u'Ö'.encode('utf8')[1]
�

If you first decode your data to a unicode object, you can iterate over codepoints rather than over bytes:

for i in input:
    l = list(i.decode('utf8'))

Note that you don't have to call list() on the object just to iterate. Looping over a string already gives you individual characters.

You could also open the file with io.open(); this gives you a file object that gives you unicode objects by default when reading, provided you tell it what codec to use:

import io

with io.open("test.txt", encoding='utf8') as f:
    input = f.readlines()

'Foreign' characters lost when they're at the final position in a list

Answers (2)

Related Questions

&#39;Foreign&#39; characters lost when they&#39;re at the final position in a list

Answers (2)

Related Questions

'Foreign' characters lost when they're at the final position in a list