Size of characters in unicode

Question

We are upgrading our database to 11g and also converting everything to Unicode. After reading online, I found out that each character in a string can take 1, 2 or 4 bytes.

I was wondering how can the system know the number of byte the character takes. Is there a reserved bit in each byte in the Unicode encoding that say "this character is 2 byte"?

Jukka K. Korpela · Accepted Answer

A Unicode character as such is an abstract concept. When characters are encoded as byte strings, they may have different lengths. In UTF-32, each character is 4 bytes. In UTF-16, each character is 2 or 4 bytes. In UTF-8, each character is 1, 2, 3, or 4 bytes.

In UTF-16, the first two bytes determine whether there are two more bytes. The additional bytes are present if the quantity defined by the first two bytes is in a specific designated range called “high surrogates”.

In UTF-8, the bit pattern of the first byte specifies how many bytes there are for the character. If the most significant bit is 0, there is just this one byte (so Ascii characters are represented just as in Ascii). If the first three bits are 110, there is one more byte. If the first four bits are 1110, two more bytes, and if 1111, three more bytes.

If you pick up an arbitrary byte from a UTF−8 stream, you cannot generally decide whether it is part of a 2, 3, or 4 byte representation. If it is one of the patterns described for the start byte, you know what it is. But if it starts with the bits 10, you cannot know.

This means that a UTF-8 stream must be processed sequentially. Direct addressing by character position is impossible; to find the Nth character, you need to start reading from the beginning and observe the bit patterns of start bytes.

Size of characters in unicode

Answers (2)

Related Questions