Reputation: 4727
I have a question as to how programs parse strings if they do not a priori know the encoding that is used.
As I understand it, the UTF-8 encoding stores ASII characters with 1 byte, and all other chracters with up to as many as 6 (I think it's 6) bytes. Thus, for example, two spaces would be stored in memory as 0x2020.
How then, would a program be able to determine the difference between this string and the string`0x2020 encoded using the UTF-16 encoding which corresponds to the single character which evidently is a character that appears similar to the symbol sometimes used to denote the adjoint of an operator in mathematics (I just looked that up here).
It seems as if the parser would always have to know the encoding of a string before hand. If so, how is this implemented in practice? Is there a byte preceeding each string which tells the parser what encoding is used or something?
Upvotes: 3
Views: 395
Reputation: 90995
Does the language always store strings in a certain encoding so that the display function could safely assume that the string was encoded, say, using UTF-8?
In depends on the language.
In C#, yes. A char
is defined by the language specification (8.2.1) as a UTF-16 code unit, and thus a string
is always UTF-16. Just like Java.
In Ruby 1.9, a string is a byte array with an associated Encoding
.
But in pre-Unicode languages like C (and badly-designed post-Unicode languages like PHP), a string is just a byte array with no encoding information. You have to rely on convention. It's a real interesting experience to write a program that uses both a library that assumes UTF-8 strings and another that assumes windows-1252 strings.
A question that's equally relevant to all languages is: How do you determine the encoding of a byte array that contains encoded text? There are several different approaches:
In protocols that use MIME types (notably, SMTP and HTTP), you can declare Content-Type: text/html; charset=UTF-8
. In HTML, you can use <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
or the newer <meta charset="UTF-8">
. In XML, there's <?xml version="1.0" encoding="UTF-8"?>
. In Python source code, there's # -*- coding: UTF-8 -*-
.
Unfortunately, such declarations aren't always accurate. And they aren't available at all for locally-stored plain .txt
files, so then a different approach must be used.
Putting the special character U+FEFF at the beginning of a file lets you distinguish between the various UTF encodings.
But it's not usable for legacy encodings like ISO-8859-x or Windows-125x, and not always used with UTF-8.
Some encodings have strict rules about what makes a valid string. The best-known is UTF-8, with its rigid separation of leading/trailing bytes, prohibition of "overlong" encodings, etc. UTF-32 is even easier to recognize because the restriction of Unicode to 17 "planes" means that every code unit must have the form 00 {00-10} xx xx (or xx xx {00-10} 00 for little-endian).
So if text validates as being UTF-8 or UTF-32, you can safely assume that it is. There's a possibility of false positives, but it's very low.
However, this approach doesn't work well for UTF-16, where the false-positive rate is too high. (The only way for an even-length byte array to not be valid UTF-16 is to contain unpaired surrogates, or U+FFFE or U+FFFF.)
Use character frequency tables of various language/encoding combinations. This is the approach used by chardet (in combination with BOM and validation).
When all else fails, assume ISO-8859-1, windows-1252, or Encoding.Default
.
Upvotes: 4
Reputation: 52159
In general, it is not possible to know for certain the exact encoding used based solely on the stream of bytes that can represent text. However, if there is a byte order mark somewhere, you can use it at least as a hint as to what encoding is being used.
But with no hints or some kind of contract/exchange of metadata between the producer and consumer of the text, you can't be 100% sure. You can try using a heuristic, but then you get these kinds of problems if you end up guessing wrong.
If you want to be really sure, set up some kind of protocol or contract between the producer and the consumer of the text so that the text and the encoding scheme is known. You can hardcode the encoding scheme (for example, your program may parse UTF-8 and only UTF-8), or ensure the producer of the text always prepend a byte order mark or specially designed header bytes to communicate the encoding scheme.
Upvotes: 7