Reputation: 1095
So, in Python 2.7 I have a string:
Python 2.7.8 (default, Apr 15 2015, 09:26:43)
[GCC 4.9.2 20150212 (Red Hat 4.9.2-6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrypt
>>> s=scrypt.encrypt('somestring', 'test'.encode('ascii'), 0.1)
>>> s
'scrypt\x00\r\x00\x00\x00\x08\x00\x00\x00\x016 \xf2\xcc\xf9\xd2\xbe\xd4\xdbU!\xaf\xecKk{\x8b\r\x94\xe8\x11\xf2\x00\x1f\xd9\xceBhf$cM\x12{\xd8\x84\\\xf2j`\xba\xc5Xk\x196)\xf5\xd3\xe9\x15\xdd\xd3\xa0A_K\x00\x18\x03J\x85\xee\n\xcc\xea\x86\xda\xaa\xfd6E\xf4\x804\xfe\x04\xca\xec!\x94F\x84)B\tf\x07\xd9!@B,\x9e\xffc\xf2\xb6e\x8c\xa9HA\x98\x99\xa0\xe8\xcf\x85P2\x13\x0f\xa1\xf6\x90nO\x85Z\xb2\xc1'
>>> type(s)
<type 'str'>
(It's real ugly.)
I need to encode it into text - either a unicode object or a utf-8 string.
TypeError: You are required to pass either a unicode object or a utf-8 string here.
You passed a Python string object which contained non-utf-8:
'scrypt\x00\r\x00\x00\x00\x08\x00\x00\x00\x01\xce\xf5\xba\x19\xeb1z/5*`m\xec\xf6sgT4\xb5.\xf7^\x96\xfaMY6\xa0\xdb\t\xa3*<5A<\xfb\xbe\xfb>w\xa3,MjaX;\xc1r\xdc\xbd\x04W\xafq3O\x90\x19!\x13\xe8\x0c\x86\xf5\xc96\xf4K\x16\xe3^.v\x8a\xe0\xda\xdd>#\xa7\\\x1c\xc2\x11\x85\x01\xb5\xd4\x92\xef\xa1k\x05Z\xaey\xd7M`%5.\x9f\xb1\xc4\x11N\xdeY\xa2\xac=\r\n\xb4aM\xfd)\xcc$\xbbq\xaa\xfd\x9d \xa5\xd39|\x85\xc8\x95\xbc\xfa\x17\xa1\x8e\xb8\x81 \xb4\x9b>j'.
The UnicodeDecodeError that resulted from attempting to interpret it as utf-8 was:
'utf8' codec can't decode byte 0xce in position 20: invalid continuation byte
The problem is, it's outside of the range of UTF-8:
>>> s.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 18: ordinal not in range(128)
So: how should I go about encoding this string?
Bonus points if you can tell me why the ascii
codec is the one having an error there (and a UnicodeDecodeError
of all things) when I'm trying to encode a string.
(For the record, trying to encode as UTF-16 throws the exact same error.)
I've gotten it to work with base64 (which is, I suppose, what that's for) but I'm curious as to why I'm getting this error and what my options are.
Upvotes: 2
Views: 7612
Reputation: 1123042
You have binary data. Not text, and certainly not Unicode. You cannot encode this to UTF-8 as it is not a unicode
(text) object.
Your UnicodeDecodeError
is caused by Python trying to decode the data first; it is trying to be helpful because normally you can only encode from Unicode to bytes. Since you tried to do this on bytes instead, it first needs to decode the bytes to Unicode, and it'll do that using the ASCII codec. But you don't have ASCII data, nor any other text encoding.
You cannot make Unicode out of those bytes because it is not text. Your only option is to use a binary-to-text scheme like base64, which wraps binary data in a manner safe for transport through systems expecting text (and thus not supporting \x00
NUL bytes or \x0a
newlines or other bytes that have special meaning in text streams.
See the binascii
library for various binary-to-text schemes available in the Python standard library; base64 is the most widely used of these.
Upvotes: 3
Reputation: 6190
The general answer is that you cannot - your generic binary data may contain byte sequences that are simply not valid utf-8. However, depending on your application, maybe you could use a binary-to-text encoding such as Base 64 to store the data wherever you need to, and then decode it upon retrieval?
Refs: https://en.wikipedia.org/wiki/Base64
https://docs.python.org/2/library/base64.html
Upvotes: 1