dakota
dakota

Reputation: 1095

Encode Bytearray into UTF-8

So, in Python 2.7 I have a string:

Python 2.7.8 (default, Apr 15 2015, 09:26:43) 
[GCC 4.9.2 20150212 (Red Hat 4.9.2-6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrypt
>>> s=scrypt.encrypt('somestring', 'test'.encode('ascii'), 0.1)
>>> s
'scrypt\x00\r\x00\x00\x00\x08\x00\x00\x00\x016 \xf2\xcc\xf9\xd2\xbe\xd4\xdbU!\xaf\xecKk{\x8b\r\x94\xe8\x11\xf2\x00\x1f\xd9\xceBhf$cM\x12{\xd8\x84\\\xf2j`\xba\xc5Xk\x196)\xf5\xd3\xe9\x15\xdd\xd3\xa0A_K\x00\x18\x03J\x85\xee\n\xcc\xea\x86\xda\xaa\xfd6E\xf4\x804\xfe\x04\xca\xec!\x94F\x84)B\tf\x07\xd9!@B,\x9e\xffc\xf2\xb6e\x8c\xa9HA\x98\x99\xa0\xe8\xcf\x85P2\x13\x0f\xa1\xf6\x90nO\x85Z\xb2\xc1'
>>> type(s)
<type 'str'>

(It's real ugly.)

I need to encode it into text - either a unicode object or a utf-8 string.

TypeError: You are required to pass either a unicode object or a utf-8 string here.
You passed a Python string object which contained non-utf-8:
'scrypt\x00\r\x00\x00\x00\x08\x00\x00\x00\x01\xce\xf5\xba\x19\xeb1z/5*`m\xec\xf6sgT4\xb5.\xf7^\x96\xfaMY6\xa0\xdb\t\xa3*<5A<\xfb\xbe\xfb>w\xa3,MjaX;\xc1r\xdc\xbd\x04W\xafq3O\x90\x19!\x13\xe8\x0c\x86\xf5\xc96\xf4K\x16\xe3^.v\x8a\xe0\xda\xdd>#\xa7\\\x1c\xc2\x11\x85\x01\xb5\xd4\x92\xef\xa1k\x05Z\xaey\xd7M`%5.\x9f\xb1\xc4\x11N\xdeY\xa2\xac=\r\n\xb4aM\xfd)\xcc$\xbbq\xaa\xfd\x9d \xa5\xd39|\x85\xc8\x95\xbc\xfa\x17\xa1\x8e\xb8\x81 \xb4\x9b>j'.
The UnicodeDecodeError that resulted from attempting to interpret it as utf-8 was:
'utf8' codec can't decode byte 0xce in position 20: invalid continuation byte

The problem is, it's outside of the range of UTF-8:

>>> s.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 18: ordinal not in range(128)

So: how should I go about encoding this string?

Bonus points if you can tell me why the ascii codec is the one having an error there (and a UnicodeDecodeError of all things) when I'm trying to encode a string.

(For the record, trying to encode as UTF-16 throws the exact same error.)

I've gotten it to work with base64 (which is, I suppose, what that's for) but I'm curious as to why I'm getting this error and what my options are.

Upvotes: 2

Views: 7612

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1123042

You have binary data. Not text, and certainly not Unicode. You cannot encode this to UTF-8 as it is not a unicode (text) object.

Your UnicodeDecodeError is caused by Python trying to decode the data first; it is trying to be helpful because normally you can only encode from Unicode to bytes. Since you tried to do this on bytes instead, it first needs to decode the bytes to Unicode, and it'll do that using the ASCII codec. But you don't have ASCII data, nor any other text encoding.

You cannot make Unicode out of those bytes because it is not text. Your only option is to use a binary-to-text scheme like base64, which wraps binary data in a manner safe for transport through systems expecting text (and thus not supporting \x00 NUL bytes or \x0a newlines or other bytes that have special meaning in text streams.

See the binascii library for various binary-to-text schemes available in the Python standard library; base64 is the most widely used of these.

Upvotes: 3

Tom Dalton
Tom Dalton

Reputation: 6190

The general answer is that you cannot - your generic binary data may contain byte sequences that are simply not valid utf-8. However, depending on your application, maybe you could use a binary-to-text encoding such as Base 64 to store the data wherever you need to, and then decode it upon retrieval?

Refs: https://en.wikipedia.org/wiki/Base64

https://docs.python.org/2/library/base64.html

Upvotes: 1

Related Questions