PHP & HTML5: UTF-8 document declaration with <meta> tag or through the header() function?

I'm trying to optimize the way my framework handles HTML5 pages generation. Right now, what I do is to insert a <meta charset="utf-8"/> right after the <head> tag, so it's the first element to be specified (so when I pass the <title> tag and the rest of page elements, it's already defined as being encoded in UTF-8).

The problem is I'm reading some books on website performance optimizations, and most of them recommends specifying the encoding through a Content-type declaration, rather than inserting a <meta> block.

The W3C documentation on character encoding detection (section 8.2.2.1) says, essentially, the HTTP headers have priority above any explicit declaration EXCEPT if the user declared an override for the content type through the user agent.

However, the W3C validator (which is why I use to debug my HTML code output) doesn't complaint but warns me about the absence of the <meta charset="utf-8"/> block, thus encouraging me to put it (it says it's specially recommended if the rendered page is to be saved, which is not the case, but still... it confuses me a bit).

The question is... how can I ensure the pages are ALWAYS specified as encoded in UTF-8? Must I declare the HTTP header AND the <meta> tag or just the HTTP header?

Upvotes: 2

Views: 10581

Answers (1)

hakre
hakre

Reputation: 197682

I could not describe it better than: The Road to HTML 5: character encoding

it's a 7-step algorithm; step 4 has 2 sub-steps, the first of which has 7 branches, one of which has 8 sub-steps, one of which actually links to a separate algorithm that itself has 7 steps... It goes on like that for a while. The gist of it is

  • User override. - You have no influence on this
  • An HTTP "charset" parameter in a "Content-Type" field. In PHP code that is:

    header('Content-Type: text/html;charset=UTF-8');
    
  • A Byte Order Mark before any other data in the HTML document itself. - I can not suggest to actually make use of that feature. If you like, just save your files accordingly, but do not expect the header() calls working flawlessly any longer. The alternative is to output the BOM manually, in PHP that is:

    echo "\xEF\xBB\xBF"; # UTF-8 BOM
    

    But even then I can not recommend to output a BOM because this is an backwards incompatible change for the output. These guidelines are for reading - not outputting.

  • A META declaration with a "charset" attribute. - Please do so, this is good practice. In HTML 5 that is:

    <meta charset="UTF-8">
    
  • A META declaration with an "http-equiv" attribute set to "Content-Type" and a value set for "charset". - Why not?! In HTML 5 that would be:

    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
    
  • Unspecified heuristic analysis. - You have no influence on this.

Those are the points. My recommendation are as following:

  • Check your webserver is sending correct headers when serving the HTML.
  • Have your HTML as well those meta-tags so that it's possible to save the HTML file on disk and open it later in a browser (offline, archive).
  • Do not put BOM inside the document if you're using UTF-8.
  • Do not use UTF-16 or UTF-32, if you use Unicode, use UTF-8.

If you are targetting systems that are totally unaware to encodings, use US-ASCII and mask everything else not part of it as HTML entities.

Note: This entitites suggestion is for output to the browser and not for storing, storing is something that falls in your area, ensure you are aware about encodings when you handle your store. Never use HTML entities for example when you write HTML into your mysql database when you don't really need it (e.g. &amp; in HTML links).

Upvotes: 5

Related Questions