Reputation: 87
I tried to do that, and I found this errors:
>>> import re
>>> x = 'Ingl\xeas'
>>> x
'Ingl\xeas'
>>> print x
Ingl�s
>>> x.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-5: unexpected end of data
>>> x.decode('utf8', 'ignore')
u'Ingl'
>>> x.decode('utf8', 'replace')
u'Ingl\ufffd'
>>> print x.decode('utf8', 'replace')
Ingl�
>>> print x.decode('utf8', 'xmlcharrefreplace')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
TypeError: don't know how to handle UnicodeDecodeError in error callback
When I use the print statement, I want that:
>>> print x
u'Inglês'
Any help is welcome.
Upvotes: 2
Views: 10650
Reputation: 336148
Ingl\xeas
is not UTF-8 but (probably) Windows-1252- or latin1-encoded. So you first need to decode it. Only then you can encode it to UTF-8.
Therefore:
>>> x = 'Ingl\xeas'
>>> print x.decode("cp1252")
Inglês
Similarly,
>>> x.decode("cp1252").encode("UTF-8")
'Ingl\xc3\xaas'
which is the correct UTF-8 representation.
By the way, in Python 3, you can (at least in the interactive console under Windows) simply type
>>> x = 'Ingl\xeas'
>>> print (x)
Inglês
since Python 3 strings are always Unicode strings (not counting bytes
objects).
Upvotes: 0
Reputation: 82934
Some observations:
(1) latin1
will decode ANY 8-bit byte without throwing an exception. Use latin1
only when you have exhausted all other possibilities. Use chardet to help deciding what a particular file or webpage or XML stream is encoded in.
(2) Possible alternatives based on very limited evidence (ONE character):
>>> import unicodedata as ucd
>>> for codepage in range(1250, 1259):
... try:
... uc = "\xea".decode(str(codepage))
... except UnicodeDecodeError:
... pass
... if uc == u'\xea': print codepage, ucd.name(uc)
...
1252 LATIN SMALL LETTER E WITH CIRCUMFLEX
1254 LATIN SMALL LETTER E WITH CIRCUMFLEX
1256 LATIN SMALL LETTER E WITH CIRCUMFLEX
1258 LATIN SMALL LETTER E WITH CIRCUMFLEX
>>>
(3) The range U+0080 to U+009F (inclusive) is assigned to "C1 control characters" which nobody outside unicode.org knows what use they could be. No matter what encoding you are using (even UTF-8), after no-exception decoding to unicode, you are not out of the woods yet. Check for characters in that range. If you find any, your data is corrupt, or your choice of encoding is not correct.
def check_for_c1_control_characters(unicode_obj):
return any('\u0080' <= c <= '\u009F' for c in unicode_obj)
or use a regex, as in this example of how to fix one of the many ways the data can be corrupted.
Upvotes: 0
Reputation: 76667
You need to know how the input data is encoded before you decode it. In some of you're attempts, you're trying to decode it from UTF-8, but Python throws an exception because the input isn't valid UTF-8. It looks like it might be latin-1. This works for me:
>>> x = 'Ingl\xeas'
>>> print x.decode('latin1')
Inglês
You mention "non-ASCII HTML". If you're writing a web server script and you're getting data from an HTTP request, you should check the Content-Type header. In an ideal world, it will tell you which encoding the client is using for the data. Keep in mind that the client may be working incorrectly.
Hope that helps!
Upvotes: 7