Michael
Michael

Reputation: 1607

Is utf-8 a character set or an encoding?

From what I understand, Unicode is a character set containing all possible characters in all languages. Utf-8 is a way to represent each of those characters in memory. If it's the case, why do we put:

<meta charset="utf-8">

and not

<meta encoding="utf-8">

in an html document to indicate an utf-8 encoding?

Upvotes: 2

Views: 1374

Answers (4)

Norman Gray
Norman Gray

Reputation: 12514

UTF-8 is an encoding of Unicode; it's not really useful to think of it as a 'character set'.

Unicode is a long-term effort to enumerate 'glyphs' in a very large range of world writing systems (a 'glyph' is what you and I would call a character). In Unicode, each of these characters is given a number – a 'codepoint' – which identifies it. Thus the glyph 'a' (the latin lowercase letter 'a') is given codepoint number 97 (it's no coincidence that the codepoint for the first 126 characters are identical to their numbers in ASCII).

Thus a 'Unicode string' is a sequence of Unicode codepoints. These are abstract integers.

If you want to actually serialise this sequence of codepoints to a file, or through a network, then you have to encode it as a sequence of bytes. That's what an 'encoding' is.

UTF-8 is one of a few standard recipes for doing this encoding; UTF-16 and UTF-32 are two other standard ones, UCS-2 is a now-deprecated one. UTF-8 is a method which takes a sequence of integers (these codepoints) and turns it into a sequence of bytes. The Wikipedia page on UTF-8 is pretty clear, I think.

Joel Spolsky has an excellent summary called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which is... well... what it says.

(Terminology: a 'character set' or 'codepage' is something like ASCII or ISO-8859-n (eg, the Latin-1 block, ISO-8859-1) which is a fixed-size table which associates a number with a character. This idea obviously overlaps somewhat with the idea of Unicode's 'list of all characters', and the fact that Unicode is sometimes described as a 'Universal Character Set' helps blur that distinction. However Unicode's clear distinction between the abstract list of integers which is a 'unicode string', and its encoding into the sequence of bytes which appears on-disk, is a very valuable one. When you've had the 'Aha!' moment and seen why that's a very useful idea, Unicode suddenly becomes very simple and obvious.)

Upvotes: 2

bobince
bobince

Reputation: 536429

<meta charset="foo"> is a mostly-compatible-by-luck abbreviation of the original HTML 2.0 <meta http-equiv="Content-Type" content="text/html; charset=foo"> construct. meta http-equiv is used (in a limited way) to smuggle HTTP headers inside an HTML document, so this construct is equivalent to setting charset=foo on the Content-Type header of the enclosing HTTP response.

The Content-Type HTTP header was taken from the MIME standard originally used for e-mail (RFC2045, originally RFC1341). This standard called it charset because it predates Unicode. In those days, ISO-8559-1, cp1251 et al were considered separate character sets. It was only when Unicode came along that it reformulated them as encoded subsets of the One True character set.

Now that the web has standardised on Unicode (actually UTF-16 code units, more's the pity) as its character model it would indeed be more accurate to describe it as an encoding. But the name charset has stuck because there is no pressing need to fix it.

Upvotes: 5

deceze
deceze

Reputation: 522165

There used to be no distinction between those two. For example, ASCII defines certain bytes to represent certain letters. It can be called both an encoding and a character set. Or a "codepage" for that matter. Those are all closely related terms essentially meaning the same thing. An "encoding" defines how certain characters are encoded in bytes. A "charset" is a set of characters that can be represented by a computer [using a specific method]. A "codepage" is a "page" of codes that map to characters. All three terms essentially mean the same thing.

Only Unicode introduced an indirection between its "set of characters" and the physical encoding they're represented in. The same is not true for most other encodings/charsets/codepages.

They had to choose some term when they created HTML. They went with charset. It doesn't have any more or less meaning than if they'd chosen encoding.

Upvotes: 1

Daniel Samson
Daniel Samson

Reputation: 718

"Character encoding is used to represent a repertoire of characters by some kind of an encoding system." - Wikipedia.

UTF-8 is a character set. It defines which binary values represent a character in an encoding system. E.g. in UTF-8 a = 01100001. Without a charset, the web browser / server may choose to use a different value for the letter a. Which can lead to all sorts of issues.

In an HTML5 5 document you should put this inside the < head > tag:

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

In an HTML 4.01 document you should put this inside the < head > tag::

<meta charset="utf-8">

Upvotes: 0

Related Questions