encode and decode for a specific character set

Question

There is no difference for the printing results, what is the usage of encoding and decoding for utf-8? And is it encode('utf8') or encode('utf-8')?

u ='abc'
print(u)
u=u.encode('utf-8')
print(u)
uu = u.decode('utf-8')
print(uu)

Nick T · Accepted Answer

str.encode encodes the string (or unicode string) into a series of bytes. In Python 3 this is a bytearray, in Python 2 it's str again (confusingly). When you encode a unicode string, you are left with bytes, not unicode—remember that UTF-8 is not unicode, it's an encoding method that can turn unicode codepoints into bytes.

str.decode will decode the serialized byte stream with the selected codec, picking the proper unicode codepoints and giving you a unicode string.

So, what you're doing in Python 2 is: 'abc' > 'abc' > u'abc', and in Python 3 is: 'abc' > b'abc' > 'abc'. Try printing repr(u) or type(u) in addition to see what's changing where.

utf_8 might be the most canonical, but it doesn't really matter.

encode and decode for a specific character set

Answers (2)

Related Questions