Reputation: 11
I have a Python 2.7 code which retrieves a base64 encoded response from a server. This response is decoded using base64
module (b64decode
/ decodestring
functions, returning str
). Its decoded content has the Unicode code points of the original strings.
I need to convert these Unicode code points to UTF-8.
The original string has a substring content "Não". When I decode the responded string, it shows:
>>> encoded_str = ... # server response
>>> decoded_str = base64.b64decode(encoded_str)
>>> type(decoded_str)
<type 'str'>
>>> decoded_str[x:y]
'N\xe3o'
When I try to encode to UTF-8, it leads to errors as
>>> (decode_str[x:y]).encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 2: ordinal not in range(128)
However, when this string is manually written in Unicode type, I can correctly convert it to my desired UTF-8 string.
>>> test_str = u'N\xe3o'
>>> test.encode('utf-8')
'N\xc3\xa3o'
I have to retrieve this response from the server and correctly generate an UTF-8 string which can be printed as "Não", how can I do this in Python 2?
Upvotes: 1
Views: 533
Reputation: 189417
You want to decode
, not encode
the byte string.
Think of it like this: a Unicode string was encoded into bytes, and these bytes were further encoded into base64.
To reverse this, you need to reverse both encodings, in the opposite order.
However, the sample you show most definitely isn't a valid UTF-8 byte string - 0xE3 in isolation is not a valid UTF-8 encoding. Most likely, the Unicode string was encoded using Latin-1 or a related encoding (the sample is much too small to establish this conclusively; other common candidates are the fugly Windows code page CP1252 and Latin-9).
Upvotes: 2