Wooyoung Cho
Wooyoung Cho

Reputation: 33

Why is an empty string '' encoded into 2 bytes in utf-16 but 0 bytes in utf-8 or ascii?

I was just learning about encoding strings in python and after fidgeting with it a little, I got confused by the fact that the size of an empty string ('') is 0 in utf 8 and ascii but somehow 2 in utf 16? how come?

print(len(''.encode('utf16'))) # is 2
print(len(''.encode('utf8'))) # is 0

I guess a big part of the problem is that I don't understand how utf 16 works. I don't understand why encoding 'spam' in utf 16 would be 10 bytes long instead of just 8 bytes (2 bytes (16 bits) for each character). I'm assuming that the 2 bytes are needed in utf 16 as default for any string for padding or something?

*edit

I am NOT confused about the basics of how UTF 8 or UTF 16 work and differ in storing each individual characters. I am confused about how the absence of any characters (an empty string) would be stored in 2 bytes in UTF 16 but have 0 bytes in UTF 8. (as opposed to 1 byte or 0 for both)

The link does not provide answer to my question.

Upvotes: 2

Views: 2328

Answers (1)

dan04
dan04

Reputation: 91189

By default, Python includes a Byte Order Mark when encoding to UTF-16, but not when encoding to UTF-8.

>>> ''.encode('utf16')
b'\xff\xfe'
>>> ''.encode('utf8')
b''

You can suppress the BOM by explicitly specifying the byte order with a BE (Big-Endian) or LE (Little-Endian) suffix.

>>> ''.encode('utf-16-le')
b''

Upvotes: 5

Related Questions