user1139666
user1139666

Reputation: 1697

JavaScript character encoding + Internet Explorer 9 coding

I have noticed stange things while performing tests.
The "stange things" concern character encoding.

For each test I have loaded an HTML page in my Internet Explorer 9 web browser.
My HTML page is encoded in UTF-8.
Here is the code of my HTML page :

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Insert title here</title>
<script type="text/javascript">
    var strChaine = "été";
    alert(strChaine.charCodeAt(0) +
            " " + strChaine.charCodeAt(1) +
            " " + strChaine.charCodeAt(2) +
            " " + strChaine.charCodeAt(3) +
            " " + strChaine.charCodeAt(4));
</script>
</head>
<body>

</body>
</html>

The HTML page contains JavaScript code to display an alert box.

Before each test I have set a specific coding in IE9 by right-clicking and then by selecting an option in the coding menu.

Test 1

For this test IE9 coding has been set to UTF-8.
The alert box has displayed : 233 116 233 NaN NaN

It seems strange to me.
Since my HTML page is encoded in UTF-8 and IE9 decodes my HTML page by using UTF-8, I expect the alert box displays : 195 169 116 196 116
0d195 0d169 0d116 0d196 0d116 is the UTF-8 decimal representation of the string "été".
0xC3 0xA9 0x74 0xC3 0xA9 is the UTF-8 hexadecimal equivalent representation.

Does anyone could justify the content really displayed in the alert box ?

Test 2

For this test IE9 coding has been set to Occidental alphabet (ISO).
The alert box has displayed : 195 169 116 195 169

Again it seems strange to me.
I have got the result I expect for Test 1.

Does anyone could justify the content displayed in the alert box ?

Upvotes: 0

Views: 1592

Answers (1)

Jukka K. Korpela
Jukka K. Korpela

Reputation: 201588

The string "été" contains three characters, with the Unicode code numbers that your script displays. This does not depend on the character encoding. JavaScript code works on characters or, to put it more exactly, on Unicode code units, not on the bytes that were used to represent the character.

If the actual encoding is UTF-8 and you make a browser treat it as being in some 8-bit encoding, which is what you probably mean by “Occidental alphabet (ISO),” then the browser misinterprets the octets of the UTF-8 representation as if each of them represented a characters

Upvotes: 2

Related Questions