Reputation: 12748
We are upgrading our database to 11g and also converting everything to Unicode. After reading online, I found out that each character in a string can take 1, 2 or 4 bytes.
I was wondering how can the system know the number of byte the character takes. Is there a reserved bit in each byte in the Unicode encoding that say "this character is 2 byte"?
Upvotes: 2
Views: 1085
Reputation: 201866
A Unicode character as such is an abstract concept. When characters are encoded as byte strings, they may have different lengths. In UTF-32, each character is 4 bytes. In UTF-16, each character is 2 or 4 bytes. In UTF-8, each character is 1, 2, 3, or 4 bytes.
In UTF-16, the first two bytes determine whether there are two more bytes. The additional bytes are present if the quantity defined by the first two bytes is in a specific designated range called “high surrogates”.
In UTF-8, the bit pattern of the first byte specifies how many bytes there are for the character. If the most significant bit is 0, there is just this one byte (so Ascii characters are represented just as in Ascii). If the first three bits are 110, there is one more byte. If the first four bits are 1110, two more bytes, and if 1111, three more bytes.
If you pick up an arbitrary byte from a UTF−8 stream, you cannot generally decide whether it is part of a 2, 3, or 4 byte representation. If it is one of the patterns described for the start byte, you know what it is. But if it starts with the bits 10, you cannot know.
This means that a UTF-8 stream must be processed sequentially. Direct addressing by character position is impossible; to find the Nth character, you need to start reading from the beginning and observe the bit patterns of start bytes.
Upvotes: 1
Reputation: 231851
First, be aware that there are major differences between Unicode and a particular encoding. There are multiple ways to encode Unicode (UTF-8, UTF-16, and UTF-32 being three of the more common) each of which has different properties. You appear to be describing the properties of the UTF-8 encoding.
Yes, the leading bit(s) within each byte of a UTF-8 encoded string indicate how many bytes a particular character uses. The Wikipedia article on the UTF-8 encoding shows the various bit-patterns for each byte for 1, 2, 3, and 4 byte characters.
Upvotes: 3