What does the first bit(i.e. binary 0) mean in UTF-8 encoding standard?

Question

I'm a PHP Developer by profession.

Consider below example :

I want to encode the word "hello" using UTF-8 encoding.

So,

Equivalent Code Points of each of the letters of the word "hello" are as below :

h = 104
e = 101
l = 108
o = 111

So, we can say that the list of decimal numbers represent the string "hello":

104 101 108 108 111

UTF-8 encoding will store "hello" like this (binary):

01101000 01100101 01101100 01101100  01101111

If you observe the above binary encoded value closely, you will come to know that every binary equivalent of decimal number has been preceded with the binary bit value 0.

My question is why this initial 0 has been prefixed to every storable character? What's the purpose of using it in UTF-8 encoding?

What has been done when the same string is encoded using UTF-16 format?

If is it necessary then can the initial extra character be a bit value 1?

Does NUL Byte mean the binary character 0?

deceze · Accepted Answer

UTF-8 is backwards compatible with ASCII. ASCII uses the values 0 - 127 and has assigned characters to them. That means bytes 0000 0000 through 0111 1111. UTF-8 keeps that same mapping for those same first 128 characters.

Any character not found in ASCII is encoded in the form of 1xxx xxxx in UTF-8, i.e. for any non-ASCII character the high bit of every encoded byte is 1. Those characters are encoded in multiple bytes in UTF-8. The first bits of the first byte in the sequence tell the decoder of how many bytes the character consists of; 110x xxxx signals that it's a 2-byte character, 1110 xxxx a 3-byte character and 1111 0xxx a 4-byte character. Subsequenct bytes in the sequence are in the form 10xx xxxx. So, no, you can't just set it to 1 arbitrarily.

There are various extensions to ASCII (e.g. ISO-8859) which set that first bit as well and thereby add another 128 characters of the form 1xxx xxxx.

There's also 7-bit ASCII which omits the first 0 bit and just uses 000 0000 through 111 1111.

Does NUL Byte mean the binary character 0?

It means the bit sequence 0000 0000, i.e. an all-zero byte with the decimal/hex/octal value 0.

You may be interested in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

What does the first bit(i.e. binary 0) mean in UTF-8 encoding standard?

Answers (2)

Related Questions