Unicode Encodings

Question

I have a question as to how programs parse strings if they do not a priori know the encoding that is used.

As I understand it, the UTF-8 encoding stores ASII characters with 1 byte, and all other chracters with up to as many as 6 (I think it's 6) bytes. Thus, for example, two spaces would be stored in memory as 0x2020.

How then, would a program be able to determine the difference between this string and the string`0x2020 encoded using the UTF-16 encoding which corresponds to the single character which evidently is a character that appears similar to the symbol sometimes used to denote the adjoint of an operator in mathematics (I just looked that up here).

It seems as if the parser would always have to know the encoding of a string before hand. If so, how is this implemented in practice? Is there a byte preceeding each string which tells the parser what encoding is used or something?

dan04 · Accepted Answer

Does the language always store strings in a certain encoding so that the display function could safely assume that the string was encoded, say, using UTF-8?

In depends on the language.

In C#, yes. A char is defined by the language specification (8.2.1) as a UTF-16 code unit, and thus a string is always UTF-16. Just like Java.

In Ruby 1.9, a string is a byte array with an associated Encoding.

But in pre-Unicode languages like C (and badly-designed post-Unicode languages like PHP), a string is just a byte array with no encoding information. You have to rely on convention. It's a real interesting experience to write a program that uses both a library that assumes UTF-8 strings and another that assumes windows-1252 strings.

A question that's equally relevant to all languages is: How do you determine the encoding of a byte array that contains encoded text? There are several different approaches:

Encoding declarations.

In protocols that use MIME types (notably, SMTP and HTTP), you can declare Content-Type: text/html; charset=UTF-8. In HTML, you can use or the newer . In XML, there's . In Python source code, there's # -*- coding: UTF-8 -*-.

Unfortunately, such declarations aren't always accurate. And they aren't available at all for locally-stored plain .txt files, so then a different approach must be used.

Byte-order mark (BOM)

Putting the special character U+FEFF at the beginning of a file lets you distinguish between the various UTF encodings.

But it's not usable for legacy encodings like ISO-8859-x or Windows-125x, and not always used with UTF-8.

Validation

Some encodings have strict rules about what makes a valid string. The best-known is UTF-8, with its rigid separation of leading/trailing bytes, prohibition of "overlong" encodings, etc. UTF-32 is even easier to recognize because the restriction of Unicode to 17 "planes" means that every code unit must have the form 00 {00-10} xx xx (or xx xx {00-10} 00 for little-endian).

So if text validates as being UTF-8 or UTF-32, you can safely assume that it is. There's a possibility of false positives, but it's very low.

However, this approach doesn't work well for UTF-16, where the false-positive rate is too high. (The only way for an even-length byte array to not be valid UTF-16 is to contain unpaired surrogates, or U+FFFE or U+FFFF.)

Statistical analysis

Use character frequency tables of various language/encoding combinations. This is the approach used by chardet (in combination with BOM and validation).

Falling back on a default encoding

When all else fails, assume ISO-8859-1, windows-1252, or Encoding.Default.