Ankur Agarwal
Ankur Agarwal

Reputation: 24748

Encoding with unicode and non unicode characters in HTML

I am using this package here: HTML.py 0.04

Here is what I am doing:

import html
h = html.HTML()
h.p('Some simple Euro: €1.14')
h.p(u'Some Euro: €1.14')

Now when I do >>> unicode(h) I get an error.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 18: ordinal not in range(128)

What is the best way to handle this? I need to write the html to a file.

Upvotes: 0

Views: 434

Answers (1)

bobince
bobince

Reputation: 536329

h.p('Some simple Euro: €1.14')

You should avoid byte strings ('' in Python 2, b'' in Python 3) for HTML content. The character model of HTML is Unicode, so only Unicode strings (u'') should be used.

You can get away with doing it wrong for simple ASCII characters. Because most common byte encodings are supersets of ASCII, Python 2 will implicitly convert ASCII byte strings to Unicode. But the character isn't part of ASCII, so Python can't tell how to read it. If you have saved the source code above using the UTF-8 encoding then you have the byte string b'\xe2\x82\xac', which could mean , €, 竄ャ, or many other character sequences depending on what encoding is used.

Upvotes: 1

Related Questions