Aan
Aan

Reputation: 12890

Two bytes of unicode letters is a myth?

I have read an article talks about text encoding. It refers that saying that a unicode letter is two bytes is a myth. It explains that but my english is not good enugh to understand the reasons.

Kindly, any one here can explain that fact if it is true and the reasons? Please ,keep simple English as possible as you can.

Upvotes: 1

Views: 364

Answers (2)

Some programmer dude
Some programmer dude

Reputation: 409176

Windows, and many legacy applications, has traditionally used 16 bits (two bytes) to represent unicode characters, but the actual standard is 21 bits (0x000000 to 0x10ffff). That's why there are so many different encodings (UTF-8 and so on). Today the most common internal representation of unicode characters inside of programs should be UTF-32 (32 bits, 4 bytes), while most are stored on disk in UTF-8 format.

For more information about the different unicode encoding schemes see this Wikipedia article: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Upvotes: 3

stefan
stefan

Reputation: 2886

It can need more, or less depending on unicode format and what character you wish to represent. At most 4 bytes per character:

Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits.

The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard.

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is useful where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

See http://www.unicode.org/standard/principles.html

Upvotes: 3

Related Questions