Ivan Rocha
Ivan Rocha

Reputation: 87

How to encode HTML non-ASCII data to UTF-8 in Python

I tried to do that, and I found this errors:

>>> import re  
>>> x = 'Ingl\xeas'  
>>> x  
'Ingl\xeas'  
>>> print x  
Ingl�s  
>>> x.decode('utf8')  
Traceback (most recent call last):  
    File "<stdin>", line 1, in <module>  
    File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode  
        return codecs.utf_8_decode(input, errors, True)  
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-5: unexpected end of data  
>>> x.decode('utf8', 'ignore')  
u'Ingl'  
>>> x.decode('utf8', 'replace')  
u'Ingl\ufffd'  
>>> print x.decode('utf8', 'replace')  
Ingl�  
>>> print x.decode('utf8', 'xmlcharrefreplace')  
Traceback (most recent call last):  
    File "<stdin>", line 1, in <module>  
    File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode  
        return codecs.utf_8_decode(input, errors, True)  
TypeError: don't know how to handle UnicodeDecodeError in error callback  

When I use the print statement, I want that:

>>> print x  
u'Inglês'  

Any help is welcome.

Upvotes: 2

Views: 10650

Answers (3)

Tim Pietzcker
Tim Pietzcker

Reputation: 336148

Ingl\xeas

is not UTF-8 but (probably) Windows-1252- or latin1-encoded. So you first need to decode it. Only then you can encode it to UTF-8.

Therefore:

>>> x = 'Ingl\xeas'
>>> print x.decode("cp1252")
Inglês

Similarly,

 >>> x.decode("cp1252").encode("UTF-8")
 'Ingl\xc3\xaas'

which is the correct UTF-8 representation.

By the way, in Python 3, you can (at least in the interactive console under Windows) simply type

>>> x = 'Ingl\xeas'
>>> print (x)
Inglês

since Python 3 strings are always Unicode strings (not counting bytes objects).

Upvotes: 0

John Machin
John Machin

Reputation: 82934

Some observations:

(1) latin1 will decode ANY 8-bit byte without throwing an exception. Use latin1 only when you have exhausted all other possibilities. Use chardet to help deciding what a particular file or webpage or XML stream is encoded in.

(2) Possible alternatives based on very limited evidence (ONE character):

>>> import unicodedata as ucd
>>> for codepage in range(1250, 1259):
...    try:
...        uc = "\xea".decode(str(codepage))
...    except UnicodeDecodeError:
...        pass
...    if uc == u'\xea': print codepage, ucd.name(uc)
...
1252 LATIN SMALL LETTER E WITH CIRCUMFLEX
1254 LATIN SMALL LETTER E WITH CIRCUMFLEX
1256 LATIN SMALL LETTER E WITH CIRCUMFLEX
1258 LATIN SMALL LETTER E WITH CIRCUMFLEX
>>>

(3) The range U+0080 to U+009F (inclusive) is assigned to "C1 control characters" which nobody outside unicode.org knows what use they could be. No matter what encoding you are using (even UTF-8), after no-exception decoding to unicode, you are not out of the woods yet. Check for characters in that range. If you find any, your data is corrupt, or your choice of encoding is not correct.

def check_for_c1_control_characters(unicode_obj):
    return any('\u0080' <= c <= '\u009F' for c in unicode_obj)

or use a regex, as in this example of how to fix one of the many ways the data can be corrupted.

Upvotes: 0

Daniel Stutzbach
Daniel Stutzbach

Reputation: 76667

You need to know how the input data is encoded before you decode it. In some of you're attempts, you're trying to decode it from UTF-8, but Python throws an exception because the input isn't valid UTF-8. It looks like it might be latin-1. This works for me:

>>> x = 'Ingl\xeas'
>>> print x.decode('latin1')
Inglês

You mention "non-ASCII HTML". If you're writing a web server script and you're getting data from an HTTP request, you should check the Content-Type header. In an ideal world, it will tell you which encoding the client is using for the data. Keep in mind that the client may be working incorrectly.

Hope that helps!

Upvotes: 7

Related Questions