oldMCdonald
oldMCdonald

Reputation: 55

confusion between encoding of a web document and the encoding explicitly used in the document

I know it's a very dumb question but unfortunately couldn't figure it out on my own. I always have a confusion when it comes to encoding and character set topics. I'll explain what I understand from the topic then I'll ask my questions.

when you want to save a file, you do it in a certain character encoding, meaning that each character of the file fits in memory according to its encoding. right?

for example if a html file has utf-16 encoding, does that means that browser uses utf-16 encoding to decode the given file to read the source code?

does using charset attribute in meta element defines what encoding the language(html) should use to properly display characters in browser?

and html added an "html character reference"on its own and it has nothing to do with unicode character codes?

Edit1:

so after the @snakecharmerb I realized some of my mistakes:

1- I didn't know that there is no metadata about [text]files encoding.

2- the charset attribute tell the browser the encoding of the file because this information can't be conceived from file itself(to some extent it can. see this answer)

3- a text file can only have one encoding and if a file encoded with utf-8 it means it follows Unicode Character Set(UCS). you can't use utf-8 encoding with another character set and today the terms utf-8 and unicode are almost interchangeable.

Upvotes: 1

Views: 92

Answers (1)

snakecharmerb
snakecharmerb

Reputation: 55699

when you want to save a file, you do it in a certain character encoding, meaning that each character of the file fits in memory according to its encoding. right?

  • yes, each character is encoded to a specific numeric value; decoding converts the numeric value back to the character

for example if a html file has utf-16 encoding, does that means that browser uses utf-16 encoding to decode the given file to read the source code?

  • the browser will attempt to decode the page using the encoding provided in the Content-Type header in the response headers from the web server; if the header is missing or does not specify an encoding, the meta charset tag in the page will be used. If neither is specified, the browser may attempt to infer the encoding from the document content, and finally fallback to latin-1

  • the w3c recommends always setting the meta tag, only setting the Content-Type header if you are sure it will be correct, and always using UTF-8 as your encoding.

does using charset attribute in meta element defines what encoding the language(html) should use to properly display characters in browser?

  • it tells the browser which encoding should be used to decode the page

and html added an "html character reference"on its own and it has nothing to do with unicode character codes?

  • html entities (like ' or ') are independent of any particular encoding, but their constituent characters will themselves will be encoded and decoded

Upvotes: 1

Related Questions