overexchange
overexchange

Reputation: 1

Decode - From bytes to any

This is Python 2

My understanding is,

decode is about decoding from bytes to anything(ascii/codepoint/utf-8/whatever..)

encode is about encoding from unicode code points to anything(bytes/ascii/utf-8/...)

From the below code,

>>> myUtf8
'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
>>> myUtf8.decode("ascii", "replace")
u'Hi \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd'
>>> myUtf8.decode('utf-16')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xa4 in position 18: truncated data

Question:

Why code point \ufffd is replaced for the bytes that can't be decoded?

Edit:

Question:

Agreement (say) is bytes are being received. how to find, bytes are expressed in which encoding? Assuming those bytes are being received from network.

Upvotes: 0

Views: 655

Answers (1)

Matteo Italia
Matteo Italia

Reputation: 126777

Your understanding is wrong. decode is "from bytes expressed in the encoding you pass as first parameter to unicode"; encode is "from unicode to bytes expressed in the encoding you pass as first parameter".

In your example, you give some bytes expressed in UTF-8 and tell to Python to interpret them as ASCII to then build a unicode string; given that all the >127 bytes aren't valid ASCII, they are considered garbage, and thus, as you requested with the "replace" parameter, they are replaced with the Unicode replacement character.

Upvotes: 2

Related Questions