Encode Bytearray into UTF-8

Question

So, in Python 2.7 I have a string:

Python 2.7.8 (default, Apr 15 2015, 09:26:43) 
[GCC 4.9.2 20150212 (Red Hat 4.9.2-6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrypt
>>> s=scrypt.encrypt('somestring', 'test'.encode('ascii'), 0.1)
>>> s
'scrypt\x00
\x00\x00\x00\x08\x00\x00\x00\x016 \xf2\xcc\xf9\xd2\xbe\xd4\xdbU!\xaf\xecKk{\x8b
\x94\xe8\x11\xf2\x00\x1f\xd9\xceBhf$cM\x12{\xd8\x84\\xf2j`\xba\xc5Xk\x196)\xf5\xd3\xe9\x15\xdd\xd3\xa0A_K\x00\x18\x03J\x85\xee
\xcc\xea\x86\xda\xaa\xfd6E\xf4\x804\xfe\x04\xca\xec!\x94F\x84)B	f\x07\xd9!@B,\x9e\xffc\xf2\xb6e\x8c\xa9HA\x98\x99\xa0\xe8\xcf\x85P2\x13\x0f\xa1\xf6\x90nO\x85Z\xb2\xc1'
>>> type(s)

(It's real ugly.)

I need to encode it into text - either a unicode object or a utf-8 string.

TypeError: You are required to pass either a unicode object or a utf-8 string here.
You passed a Python string object which contained non-utf-8:
'scrypt\x00
\x00\x00\x00\x08\x00\x00\x00\x01\xce\xf5\xba\x19\xeb1z/5*`m\xec\xf6sgT4\xb5.\xf7^\x96\xfaMY6\xa0\xdb	\xa3*<5A<\xfb\xbe\xfb>w\xa3,MjaX;\xc1r\xdc\xbd\x04W\xafq3O\x90\x19!\x13\xe8\x0c\x86\xf5\xc96\xf4K\x16\xe3^.v\x8a\xe0\xda\xdd>#\xa7\\x1c\xc2\x11\x85\x01\xb5\xd4\x92\xef\xa1k\x05Z\xaey\xd7M`%5.\x9f\xb1\xc4\x11N\xdeY\xa2\xac=
\xb4aM\xfd)\xcc$\xbbq\xaa\xfd\x9d \xa5\xd39|\x85\xc8\x95\xbc\xfa\x17\xa1\x8e\xb8\x81 \xb4\x9b>j'.
The UnicodeDecodeError that resulted from attempting to interpret it as utf-8 was:
'utf8' codec can't decode byte 0xce in position 20: invalid continuation byte

The problem is, it's outside of the range of UTF-8:

>>> s.encode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 18: ordinal not in range(128)

So: how should I go about encoding this string?

Bonus points if you can tell me why the ascii codec is the one having an error there (and a UnicodeDecodeError of all things) when I'm trying to encode a string.

(For the record, trying to encode as UTF-16 throws the exact same error.)

I've gotten it to work with base64 (which is, I suppose, what that's for) but I'm curious as to why I'm getting this error and what my options are.

Martijn Pieters · Accepted Answer

You have binary data. Not text, and certainly not Unicode. You cannot encode this to UTF-8 as it is not a unicode (text) object.

Your UnicodeDecodeError is caused by Python trying to decode the data first; it is trying to be helpful because normally you can only encode from Unicode to bytes. Since you tried to do this on bytes instead, it first needs to decode the bytes to Unicode, and it'll do that using the ASCII codec. But you don't have ASCII data, nor any other text encoding.

You cannot make Unicode out of those bytes because it is not text. Your only option is to use a binary-to-text scheme like base64, which wraps binary data in a manner safe for transport through systems expecting text (and thus not supporting \x00 NUL bytes or \x0a newlines or other bytes that have special meaning in text streams.

See the binascii library for various binary-to-text schemes available in the Python standard library; base64 is the most widely used of these.

Encode Bytearray into UTF-8

Answers (2)

Related Questions