Unicode characters output from python I/O to files

Question

I don't know if this is my misunderstanding of UTF-8 or of python, but I'm having trouble understanding how python writes Unicode characters to a file. I'm on a Mac under OSX by the way, if that makes a difference.

Let's say I have the following unicode string

foo=u'\x93Stuff in smartquotes\x94 '

Here \x93 and \x94 are those awful smart-quotes.

Then I write it to a file:

with open('file.txt','w') as file: file.write(foo.encode('utf8'))

When I open the file in a text editor like TextWrangler or in a web browser file.txt seems like it was written as

\xc2\x93**Stuff in smartquotes\xc2\x94

The text editor properly understands the file to be UTF8 encoded, but it renders \xc2\x93 as garbage. If I go in and manually strip out the \xc2 part, I get what I expect, and TextWrangler and Firefox render the utf characters as smartquotes.

This is exactly what I get when I read the file back into python without decoding it as 'utf8'. However, when I do read it in with the read().decode('utf8') method, I get back what I originally put in, without the \xc2 bit.

This is driving me bonkers, because I'm trying to parse a bunch of html files into text and the incorrect rendering of these unicode characters is screwing up a bunch of stuff.

I also tried it in python3 using the read/write methods normally, and it has the same behavior.

edit: Regarding stripping out the \xc2 manually, it turns out that it was rendering correctly when I did that because the browser and text editors were defaulting to Latin encoding.

Also, as a follow up, Filefox renders the text as

☐Stuff in smartquotes☐

where the boxes are empty unicode values, while Chrome renders the text as

Stuff in smartquotes

Mark Tolonen · Accepted Answer

The problem is, u'\x93' and u'\x94' are not the Unicode codepoints for smart quotes. They are smart quotes in the Windows-1252 encoding, which is not the same as the latin1 encoding. In latin1, those values are not defined.

>>> import unicodedata as ud
>>> ud.name(u'\x93')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: no such name
>>> import unicodedata as ud
>>> ud.name(u'\x94')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: no such name
>>> ud.name(u'\u201c')
'LEFT DOUBLE QUOTATION MARK'
>>> ud.name(u'\u201d')
'RIGHT DOUBLE QUOTATION MARK'

So you should one of:

foo = u'\u201cStuff in smartquotes\u201d'
foo = u'\N{LEFT DOUBLE QUOTATION MARK}Stuff in smartquotes\N{RIGHT DOUBLE QUOTATION MARK}'

or in a UTF-8 source file:

#coding:utf8
foo = u'“Stuff in smartquotes”'

Edit: If you somehow have a Unicode string with those incorrect bytes in it, here's a way to fix them. The first 256 Unicode codepoints map 1:1 with latin1 encoding, so it can be used to encode a mis-decoded Unicode string directly back to a byte string so the correct decoding can be used:

>>> foo = u'\x93Stuff in smartquotes\x94'
>>> foo
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('windows-1252')
'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”

If you have the UTF-8-encoded version of the incorrect Unicode characters:

>>> foo = '\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo = foo.decode('utf8').encode('latin1').decode('windows-1252')
>>> foo
u'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”

And if you have the very worst case the following Unicode string:

>>> foo = u'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1') # back to a UTF-8 encoded byte string.
'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1').decode('utf8') # Undo the UTF-8, but Unicode is still wrong.
u'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1') # back to a byte string.
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1').decode('windows-1252') # Now decode correctly.
u'\u201cStuff in smartquotes\u201d'

Unicode characters output from python I/O to files

Answers (1)

Related Questions