Reputation: 7289
I had read this great tutorial
http://www.joelonsoftware.com/articles/Unicode.html
But I didn't understand how UTF-8 solves high-endian, low-endian machines thing. For 1byte, its fine. For multi byte, how it works?
Can someone explain better?
Upvotes: 1
Views: 787
Reputation: 12708
Utf-8 has no endiannes, as only one byte is transmitted, no endiannes possible, as each byte is treated sequentially. This said, BOM is of no use in utf-8 and if present is always transmitted as the same sequence of bytes.
Upvotes: 0
Reputation: 19114
There is no endiannes problem with UTF-8. The problem arises with UTF-16, because there's a need to see a sequence of two-byte chunks as a sequence of byte chunks when writing it into a file or a communication stream, which may have different idea about byte order in a two-byte number. Because UTF-8 works at byte level, there's no need for BOM to be able to parse the sequence correctly on both a big-endian and a little-endian machine. It does not matter if a character is multibyte: UTF-8 defines exactly what order should the characters come, in case of a multi-byte encoding of a codepoint.
The BOM in UTF-8 is for something completely different (well, so the name 'Byte Order Mark' is a litle 'off'). It is to manifest that "this is going to be a UTF-8 stream". UTF-8 BOM is generally unpopular, and many programs do not support it correctly. The site utf8everywhere.org believes it should be deprecated in future.
Upvotes: 1
Reputation: 1141
Here is a link that explains UTF-8 in depth. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
At the heart of it, UTF-16 is short integer(16 bit) oriented and UTF-8 is byte oriented. Since architectures can differ on how the bytes of a datatypes are ordered(big endian, little endian) the UTF-16 encoding can go either way. On all architectures I am aware of there is no endian-ness at the nibble or semi-octet level. All bytes are a sequential series of 8 bits. Therefore UTF-8 has no endian-ness.
The Japanese character あ is a good example. It is U+3042 (binary=0011 0000 : 0100 0010).
Here is some information on unicode あ
Upvotes: 5