Reputation: 1
This is Python 2
My understanding is,
decode is about decoding from bytes to anything(ascii/codepoint/utf-8/whatever..)
encode is about encoding from unicode code points to anything(bytes/ascii/utf-8/...)
From the below code,
>>> myUtf8
'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
>>> myUtf8.decode("ascii", "replace")
u'Hi \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd'
>>> myUtf8.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xa4 in position 18: truncated data
Question:
Why code point \ufffd
is replaced for the bytes that can't be decoded?
Edit:
Question:
Agreement (say) is bytes are being received. how to find, bytes are expressed in which encoding? Assuming those bytes are being received from network.
Upvotes: 0
Views: 655
Reputation: 126777
Your understanding is wrong. decode
is "from bytes expressed in the encoding you pass as first parameter to unicode
"; encode
is "from unicode
to bytes expressed in the encoding you pass as first parameter".
In your example, you give some bytes expressed in UTF-8 and tell to Python to interpret them as ASCII to then build a unicode
string; given that all the >127 bytes aren't valid ASCII, they are considered garbage, and thus, as you requested with the "replace"
parameter, they are replaced with the Unicode replacement character.
Upvotes: 2