Jon Cram
Jon Cram

Reputation: 17309

Does HTML5 specify a default character encoding for HTML documents if no character encoding is supplied?

An example HTML document retrieved over HTTP lacks:

With regards to HTML5, is a default, for example UTF-8, assumed as the character encoding? Or is it entirely up the application reading the HTML document to choose a default?

Upvotes: 22

Views: 9285

Answers (1)

ThiefMaster
ThiefMaster

Reputation: 318508

The charset is determined using these rules:

  1. User override.
  2. An HTTP "charset" parameter in a "Content-Type" field.
  3. A Byte Order Mark before any other data in the HTML document itself.
  4. A META declaration with a "charset" attribute.
  5. A META declaration with an "http-equiv" attribute set to "Content-Type" and a value set for "charset".
  6. Unspecified heuristic analysis.

...and then...

  1. Normalize the given character encoding string according to the Charset Alias Matching rules defined in Unicode Technical Standard #22.
  2. Override some problematic encodings, i.e. intentionally treat some encodings as if they were different encodings. The most common override is treating US-ASCII and ISO-8859-1 as Windows-1252, but there are several other encoding overrides listed in this table. As the specification notes, "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification."

But the most important thing is:

You should always specify a character encoding on every HTML document, or bad things will happen. You can do it the hard way (HTTP Content-Type header), the easy way (<meta http-equiv> declaration), or the new way (<meta charset> attribute), but please do it. The web thanks you.

Sources:

Upvotes: 20

Related Questions