Reputation: 393
I don't know if this is my misunderstanding of UTF-8 or of python, but I'm having trouble understanding how python writes Unicode characters to a file. I'm on a Mac under OSX by the way, if that makes a difference.
Let's say I have the following unicode string
foo=u'\x93Stuff in smartquotes\x94\n'
Here \x93 and \x94 are those awful smart-quotes.
Then I write it to a file:
with open('file.txt','w') as file:
file.write(foo.encode('utf8'))
When I open the file in a text editor like TextWrangler or in a web browser file.txt
seems like it was written as
\xc2\x93**Stuff in smartquotes\xc2\x94\n
The text editor properly understands the file to be UTF8 encoded, but it renders \xc2\x93 as garbage. If I go in and manually strip out the \xc2 part, I get what I expect, and TextWrangler and Firefox render the utf characters as smartquotes.
This is exactly what I get when I read the file back into python without decoding it as 'utf8'. However, when I do read it in with the read().decode('utf8')
method, I get back what I originally put in, without the \xc2 bit.
This is driving me bonkers, because I'm trying to parse a bunch of html files into text and the incorrect rendering of these unicode characters is screwing up a bunch of stuff.
I also tried it in python3 using the read/write methods normally, and it has the same behavior.
edit: Regarding stripping out the \xc2 manually, it turns out that it was rendering correctly when I did that because the browser and text editors were defaulting to Latin encoding.
Also, as a follow up, Filefox renders the text as
☐Stuff in smartquotes☐
where the boxes are empty unicode values, while Chrome renders the text as
Stuff in smartquotes
Upvotes: 2
Views: 4842
Reputation: 178115
The problem is, u'\x93'
and u'\x94'
are not the Unicode codepoints for smart quotes. They are smart quotes in the Windows-1252
encoding, which is not the same as the latin1
encoding. In latin1
, those values are not defined.
>>> import unicodedata as ud
>>> ud.name(u'\x93')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> import unicodedata as ud
>>> ud.name(u'\x94')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> ud.name(u'\u201c')
'LEFT DOUBLE QUOTATION MARK'
>>> ud.name(u'\u201d')
'RIGHT DOUBLE QUOTATION MARK'
So you should one of:
foo = u'\u201cStuff in smartquotes\u201d'
foo = u'\N{LEFT DOUBLE QUOTATION MARK}Stuff in smartquotes\N{RIGHT DOUBLE QUOTATION MARK}'
or in a UTF-8 source file:
#coding:utf8
foo = u'“Stuff in smartquotes”'
Edit: If you somehow have a Unicode string with those incorrect bytes in it, here's a way to fix them. The first 256 Unicode codepoints map 1:1 with latin1
encoding, so it can be used to encode a mis-decoded Unicode string directly back to a byte string so the correct decoding can be used:
>>> foo = u'\x93Stuff in smartquotes\x94'
>>> foo
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('windows-1252')
'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”
If you have the UTF-8-encoded version of the incorrect Unicode characters:
>>> foo = '\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo = foo.decode('utf8').encode('latin1').decode('windows-1252')
>>> foo
u'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”
And if you have the very worst case the following Unicode string:
>>> foo = u'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1') # back to a UTF-8 encoded byte string.
'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1').decode('utf8') # Undo the UTF-8, but Unicode is still wrong.
u'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1') # back to a byte string.
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1').decode('windows-1252') # Now decode correctly.
u'\u201cStuff in smartquotes\u201d'
Upvotes: 6