Why does the character encoding of HTML source code need to match the one declared in the meta charset tag?

The title very much describes the question. The HTML standard says explicitly that the only value acceptable for the <meta charset> tag is UTF-8, and even the Nu Html Checker throws an error if any other value is used. But, it also throws another error at the same time:

Internal encoding declaration iso-8859-1 disagrees with the actual encoding of the document (utf-8)

I used the value 'iso-8859-1' in the <meta> charset this time.

If the only value we can use is UTF-8, I just can't understand the need to state this as a rule to follow.

The HTML standard itself doesn't say much about this, and the encoding standard of Whatwg is too technical for me to read and find out myself, so I need help understanding this.

Upvotes: -1

Answers (1)

AmigoJack

Reputation: 6174

Why does the character encoding of HTML source code need to match the one declared in the meta charset tag?

Because of consistency. Parsers have multiple hints to get to know the text encoding:
- HTTP header Content-Type (RFC 9110) using its subtype token charset= and one of IANA's registered Character Sets (f.e. UTF-8).
- A (pseudo) BOM in the payload's bytes, unrelated to if one should avoid them or not (f.e. 0xEF BB BF for UTF-8) and unbound to how the bytes must be treated.
- A markup attribute when treating the bytes as text (this implies the parser already had partial success in treating the bytes as text), such as:
When you declare an encoding in the HTML/XML markup it definitly makes no sense to write something different than what the actual encoding is - that would just be lying. What's the benefit to encode the text in f.e. UTF-16BE and then saying in the markup that it should be UTF16-LE?
The HTML standard says explicitly that the only value acceptable for the <meta charset> tag is UTF-8

That's the HTML5 standard as of today, for both:
- <meta http-equiv="content-type" content= and
- <meta charset=.
But in the past the HTML5.2 specs did not limit it to UTF-8 only. Likewise HTML4.01 and XHTML1 don't know that limitation either.
the Nu Html Checker throws an error if any other value is used

The HTML5.2 specs even define this to further guide everyone onto using UTF-8 only:

Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings.
If the only value we can use is UTF-8, I just can't understand the need to state this as a rule to follow.

See it as a fallback to make it extra clear that your text is and should be treated as UTF-8. There may enough (legacy) parsers which aren't perfect, or which would never guess a text encoding at all (but at the same time cannot distinguish that precisely HTML5 from HTML4.01 and XHTML to then know and assume only UTF-8 is allowed for the former). The declaration is also a double check to authors if they really understand what they type there, giving a hint to the last idiot as in "wait - when I declare this document as UTF-8 - have I really saved it as such...? Is my HTTP server really sending the correct headers?".

You're still free to use other text encodings, but then you can't use HTML5 (or XHTML5). Nobody is holding you back to either use HTML4.01 or XHTML1. See also Declaring character encodings in HTML, mentioning additional hints on why UTF-8 is the way to go and how the markup declaration makes sense:

The declaration should fit completely within the first 1024 bytes at the start of the file

the HTTP header has a higher precedence than the in-document meta declarations, content authors should always take into account whether the character encoding is already declared in the HTTP header. If it is, the meta element must be set to declare the same encoding.

If you have a UTF-8 byte-order mark (BOM) at the start of your file then modern browsers will use that to determine that the encoding of your page is UTF-8. It has a higher precedence than any other declaration, including the HTTP header.

The XML declaration is only required if the page is not being served as UTF-8

The HTML5 specification forbids the use of the meta element to declare UTF-16, because the values must be ASCII-compatible. Instead you should ensure that you always have a byte-order mark at the very start of a UTF-16 encoded file. In effect, this is the in-document declaration.

It also links to Choosing & applying a character encoding:

must not use JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), encodings based on ISO-2022, or encodings based on EBCDIC ... because ... poses a security threat.

must also not use CESU-8, UTF-7, BOCU-1, or SCSU encodings, since they were never intended for Web content and the HTML5 specification forbids browsers from recognising them.

strongly discourages ... UTF-16, ... UTF-32

Other ... should also be avoided. These include Big5 and EUC-JP ... ISO-8859-8

Upvotes: 1

Why does the character encoding of HTML source code need to match the one declared in the meta charset tag?

Answers (1)

Related Questions