Reputation:
I'm a PHP Developer by profession.
Consider below example :
I want to encode the word "hello" using UTF-8 encoding.
So,
Equivalent Code Points of each of the letters of the word "hello" are as below :
h = 104
e = 101
l = 108
o = 111
So, we can say that the list of decimal numbers represent the string "hello":
104 101 108 108 111
UTF-8 encoding will store "hello" like this (binary):
01101000 01100101 01101100 01101100 01101111
If you observe the above binary encoded value closely, you will come to know that every binary equivalent of decimal number has been preceded with the binary bit value 0
.
My question is why this initial 0
has been prefixed to every storable character? What's the purpose of using it in UTF-8 encoding?
What has been done when the same string is encoded using UTF-16 format?
If is it necessary then can the initial extra character be a bit value 1
?
Does NUL Byte mean the binary character 0
?
Upvotes: 1
Views: 2480
Reputation: 111860
UTF-8 encodes Unicode codepoints U+0000 - U+007F (which are the ASCII characters 0-127) using 7 bits. The eighth bit is used to signal when additional bytes are necessary only when encoding Unicode codepoints U+0080 - U+10FFFF.
For example, è
is codepoint U+00E8, which is encoded in UTF-8 as bytes 0xC3 0xA8
(11000011 10101000
in binary).
Wikipedia explains quite well how UTF-8 is encoded.
Does NUL Byte mean the binary character 0?
Yes.
Upvotes: 1
Reputation: 522091
UTF-8 is backwards compatible with ASCII. ASCII uses the values 0 - 127 and has assigned characters to them. That means bytes 0000 0000
through 0111 1111
. UTF-8 keeps that same mapping for those same first 128 characters.
Any character not found in ASCII is encoded in the form of 1xxx xxxx
in UTF-8, i.e. for any non-ASCII character the high bit of every encoded byte is 1
. Those characters are encoded in multiple bytes in UTF-8. The first bits of the first byte in the sequence tell the decoder of how many bytes the character consists of; 110x xxxx
signals that it's a 2-byte character, 1110 xxxx
a 3-byte character and 1111 0xxx
a 4-byte character. Subsequenct bytes in the sequence are in the form 10xx xxxx
. So, no, you can't just set it to 1
arbitrarily.
There are various extensions to ASCII (e.g. ISO-8859) which set that first bit as well and thereby add another 128 characters of the form 1xxx xxxx
.
There's also 7-bit ASCII which omits the first 0
bit and just uses 000 0000
through 111 1111
.
Does NUL Byte mean the binary character
0
?
It means the bit sequence 0000 0000
, i.e. an all-zero byte with the decimal/hex/octal value 0
.
You may be interested in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
Upvotes: 5