Carlitos_30
Carlitos_30

Reputation: 370

Latin ISO recognize character but not UTF8 in html document

I have the following code:

<html>
<head>
    <meta charset="utf-8">
</head>

<body>
    <p>Schrödinger
</body>

When I run it in my browser I get:

Schr�dinger

When I change the encoding to latin ISO:

<html>
<head>
    <meta charset="ISO-8859-1">
</head>

<body>
    <p>Schrödinger
</body>

It works good:

Schrödinger

Curiously, using the code snippet tool on this site, utf-8 works good:

<html>
	<head>
		<meta charset="utf-8">
	</head>

	<body>
		<p>Schrödinger
	</body>
</html>

Using UTF8 should work even better than Latin ISO (it supports more characters).

What can the problem be?

I tested both in Chrome and Firefox. I am using Windows 7 in an old PC.

Upvotes: 1

Views: 1271

Answers (3)

andrewJames
andrewJames

Reputation: 21993

Here is a slightly different approach from the other answers, using a hands-on demonstration to recreate the issue, and then fix it.

(my example uses Notepad++).

1) Create a new text file, and before adding any data or saving it, change the encoding to ANSI (menu: Encoding > ANSI). This assumes UTF-8 is the default.

2) Enter the following text and save as "cat.htm".

<html>
  <head>
    <meta charset="UTF-8">
  </head>
  <body>
    <div>Schrödinger</div>
  </body>
</html>

3) Open the file with Firefox, Chrome, etc.

You will see Schr�dinger.

If you take the above example and change the file's encoding back to UTF-8 in Notepad++ (and reinstate the ö) then you get the expected output: Schrödinger. So, yes, it's all about how the source file was saved - the binary representation.

Upvotes: 2

IMSoP
IMSoP

Reputation: 97858

You are right that UTF-8 can represent more characters than ISO-8859-1, but it also represents the same characters differently.

To understand what that means, you need to think about the binary representation that a computer uses for text. When you save some text to a file, what you are actually doing is writing some sequence of ones and zeroes to disk; when you load that file in a web browser, it has to look at that sequence of ones and zeroes and decide what to display.

A character encoding is the way that the browser decides what to display for each sequence of ones and zeroes.

In ISO-8859-1, the character "ö" is written as the sequence 111101110. In UTF-8, that same character would instead be written 1100001110110110, and 111101110 would mean something else (in fact, because of the way UTF-8 works, it represents half of something, so can't be displayed).

Your file contains 111101110, so the correct thing to tell the browser is "read this as ISO 8859-1 please". Alternatively, you can open the file in an editor that "knows" both encodings, and tell that editor to rewrite it as UTF-8, so the character will be saved as 1100001110110110 instead.

This is what happens when you paste the character here: your browser knows that Stack Overflow wants the UTF-8 version, and converts it to 1100001110110110 for you.

Upvotes: 4

Joffrey Schmitz
Joffrey Schmitz

Reputation: 2438

The encoding is basically how the data are written in binary. The same character (e.g. ö ) has different binary representation depending on the charset : if your file is written latin-1 and you declare your charset as latin-1, the browser will decode it fine. If your file is written in UTF-8 and you declare your charset as utf-8, the browser will decode it fine. But if you "lie" to the browser by telling him your file is in utf-8 while it is encoded in latin-1, it will be unable to decode some characters correctly.

Basic ASCII characters have usually the same binary representation whatever the encoding, so it is generally fine, but with accentued characters, it matters to declare the correct encoding.

You must take into account how you wrote the file to declare the appropriate charset, it is not a wish on what character set you want.

Upvotes: 2

Related Questions