Reputation: 33
I was just learning about encoding strings in python and after fidgeting with it a little, I got confused by the fact that the size of an empty string ('') is 0 in utf 8 and ascii but somehow 2 in utf 16? how come?
print(len(''.encode('utf16'))) # is 2
print(len(''.encode('utf8'))) # is 0
I guess a big part of the problem is that I don't understand how utf 16 works. I don't understand why encoding 'spam' in utf 16 would be 10 bytes long instead of just 8 bytes (2 bytes (16 bits) for each character). I'm assuming that the 2 bytes are needed in utf 16 as default for any string for padding or something?
*edit
I am NOT confused about the basics of how UTF 8 or UTF 16 work and differ in storing each individual characters. I am confused about how the absence of any characters (an empty string) would be stored in 2 bytes in UTF 16 but have 0 bytes in UTF 8. (as opposed to 1 byte or 0 for both)
The link does not provide answer to my question.
Upvotes: 2
Views: 2328
Reputation: 91189
By default, Python includes a Byte Order Mark when encoding to UTF-16, but not when encoding to UTF-8.
>>> ''.encode('utf16')
b'\xff\xfe'
>>> ''.encode('utf8')
b''
You can suppress the BOM by explicitly specifying the byte order with a BE
(Big-Endian) or LE
(Little-Endian) suffix.
>>> ''.encode('utf-16-le')
b''
Upvotes: 5