theJava
theJava

Reputation: 15034

Difference between UTF-8 and UTF-16?

Difference between UTF-8 and UTF-16? Why do we need these?

MessageDigest md = MessageDigest.getInstance("SHA-256");
String text = "This is some text";

md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
byte[] digest = md.digest();

Upvotes: 155

Views: 147213

Answers (6)

plugwash
plugwash

Reputation: 10504

Difference between UTF-8 and UTF-16?

UTF-8 is a sequence of 8 bit bytes, while UTF-16 is a sequence of 16 bit units (hereafter referred to as words).

In UTF-8 code points with values 0 to 0x7F are encoded directly as single bytes, code points with values 0x100 to 0x7FF as two bytes, code points with values 0x800 to 0xFFFF as three bytes and code points with values 0x100000 to 0x10FFFF encoded as four bytes.

In UTF-16 code points 0x0000 to 0xFFFF (note: values 0xD800 to 0xDFFF are not valid Unicode code points) are encoded directly as single words. Code points with values 0x100000 to 0x10FFFF are encoded as two words. These two word sequences are known as surrogate pairs.

Why do we need these?

Because history is messy. Different companies and organisations have different priorities and ideas, and once a format decision is made, it tends to stick around.

Back in 1989 the ISO had proposed a Universal character set as a draft of ISO 10646, but the major software vendors did not like it, seeing it as over-complicated. They devised their own system called Unicode, a fixed-width 16-bit encoding. The software companies convinced a sufficient number of national standards bodies to vote down the draft of ISO 10646 and ISO was pushed into unification with Unicode.

This original 16-bit Unicode was adopted as the native internal format by a number of major software products. Two of the most notable were Java (released in 1996) and Windows NT (released in 1993). A string in Java or NT is, at its most fundamental, a sequence of 16-bit values.

There was a need to encode Unicode in byte-orientated "extended ASCII" environments. The ISO had proposed a standard "UTF-1" for this, but people didn't like it, it was slow to implement because it involved modulo operators and the encoded data had some undesirable properties.

x-open circulated a proposal for a new standard for encoding Unicode/UCS values in extended ASCII environments. This was altered slightly by the Plan 9 developers to become what we now know as UTF-8.

Eventually, the software vendors had to concede that 16 bits was not enough. In particular, China was pressing heavily for support for historic Chinese characters that were two numerous to encode in 16 bits.

The end result was Unicode 2.0, which expanded the code space to just over 20 bits and introduced UTF-16. At the same time, Unicode 2.0 also elevated UTF-8 to be a formal part of the standard. Finally it introduced UTF-32, a new fixed width encoding.

In practice due to compatibility and efficiency considerations, relatively few systems adopted UTF-32. Those systems that had adopted the original 16 bit Unicode (e.g. Windows, Java) moved to UTF-16, while those that had remained byte orientated (e.g. Unix, the Internet) continued their gradual move from legacy 8 bit encodings to UTF-8.

Upvotes: 0

Venkateswara Rao
Venkateswara Rao

Reputation: 5392

Simple way to differentiate UTF-8 and UTF-16 is to identify commonalities between them.

Other than sharing same unicode number for given character, each one is their own format.

UTF-8 try to represent, every unicode number given to character with one byte(If it is ASCII), else 2 two bytes, else 4 bytes and so on...

UTF-16 try to represent, every unicode number given to character with two byte to start with. If two bytes are not sufficient, then uses 4 bytes. IF that is also not sufficient, then uses 6 bytes.

Theoretically, UTF-16 is more space efficient, but in practical UTF-8 is more space efficient as most of the characters(98% of data) for processing are ASCII and UTF-8 try to represent them with single byte and UTF-16 try to represent them with 2 bytes.

Also, UTF-8 is superset of ASCII encoding. So every app that expects ASCII data would also accepted by UTF-8 processor. This is not true for UTF-16. UTF-16 could not understand ASCII, and this is big hurdle for UTF-16 adoption.

Another point to note is, all UNICODE as of now could be fit in 4 bytes of UTF-8 maximum(Considering all languages of world). This is same as UTF-16 and no real saving in space compared to UTF-8 ( https://stackoverflow.com/a/8505038/3343801 )

So, people use UTF-8 where ever possible.

Upvotes: -3

Basil Bourque
Basil Bourque

Reputation: 338564

Security: Use only UTF-8

Difference between UTF-8 and UTF-16? Why do we need these?

There have been at least a couple of security vulnerabilities in implementations of UTF-16. See Wikipedia for details.

WHATWG and W3C have now declared that only UTF-8 is to be used on the Web.

The [security] problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.

Other groups are saying the same.

So while UTF-16 may continue being used internally by some systems such as Java and Windows, what little use of UTF-16 you may have seen in the past for data files, data exchange, and such, will likely fade away entirely.

Upvotes: 11

Sergei Tachenov
Sergei Tachenov

Reputation: 24869

I believe there are a lot of good articles about this around the Web, but here is a short summary.

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.

Main UTF-8 pros:

  • Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
  • No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
  • UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.

Main UTF-8 cons:

  • Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
  • Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

Main UTF-16 pros:

  • BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
  • Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.

Main UTF-16 cons:

  • Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
  • Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
  • It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.

In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.

Upvotes: 317

Jon Skeet
Jon Skeet

Reputation: 1500515

They're simply different schemes for representing Unicode characters.

Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane (BMP) which contains most characters in common use.

UTF-8 uses between 1 and 3 bytes for characters in the BMP, up to 4 for characters in the current Unicode range of U+0000 to U+1FFFFF, and is extensible up to U+7FFFFFFF if that ever becomes necessary... but notably all ASCII characters are represented in a single byte each.

For the purposes of a message digest it won't matter which of these you pick, so long as everyone who tries to recreate the digest uses the same option.

See this page for more about UTF-8 and Unicode.

(Note that all Java characters are UTF-16 code points within the BMP; to represent characters above U+FFFF you need to use surrogate pairs in Java.)

Upvotes: 20

bestsss
bestsss

Reputation: 12056

This is unrelated to UTF-8/16 (in general, although it does convert to UTF16 and the BE/LE part can be set w/ a single line), yet below is the fastest way to convert String to byte[]. For instance: good exactly for the case provided (hash code). String.getBytes(enc) is relatively slow.

static byte[] toBytes(String s){
        byte[] b=new byte[s.length()*2];
        ByteBuffer.wrap(b).asCharBuffer().put(s);
        return b;
    }

Upvotes: 4

Related Questions