Martin Thoma
Martin Thoma

Reputation: 136665

Are there examples of ISO 8859-1 text files which are valid, but different in UTF-8?

I know that UTF-8 supports way more characters than Latin-1 (even with the extensions). But are there examples of files that are valid in both, but the characters are different? So essentially that the content changes, depending on how you think the file is encoded?

I also know that a big chunk of Latin-1 maps 1:1 to the same part in UTF-8. The question is: which code points could change the value if interpreted differently (not invalid, but different)?

Upvotes: 3

Views: 1562

Answers (2)

deceze
deceze

Reputation: 522500

Latin-1 is a single-byte encoding (meaning 1 character = 1 byte), which uses all possible byte values. So any byte maps to something in Latin-1. So literally any file is "valid" in Latin-1. So you can interpret any file as Latin-1 and you'll get… something… as a result.

So yes, interpret any valid UTF-8 file in Latin-1. It's valid both in UTF-8 and Latin-1. The first 128 characters are the same for both encodings and both based on ASCII; but if your UTF-8 file uses any non-ASCII characters, those will be interpreted as gibberish (yet valid) Latin-1.

bytes encoding text
e6bc a2e5 ad97 UTF-8 漢字
e6bc a2e5 ad97 Latin-1 æ¼¢å­ 👈 valid but nonsensical

Upvotes: 5

Codo
Codo

Reputation: 78945

Unicode is - somewhat simplified - a character set, and UTF-8 is one of multiple encodings for the binary representation of Unicode.

ISO-8859-1 is both, a character set and encoding.

At the character set level, ISO-8859-1 is a subset of Unicode, i.e. each ISO-8859-1 character also exists in Unicode, and the ISO-8859-1 code is even equal to the Unicode codepoint.

At the encoding level, ISO-8859-1 and UTF-8 use the same binary representation for the ISO-8859-1 characters up to 127. But for the characters between 128 and 255 they differ as UTF-8 needs 2 bytes to represent them.

Example:

Word ISO-8859-1 UTF-8
Zürich 5a fc 72 69 63 68 5a c3 bc 72 69 63 68

Upvotes: 2

Related Questions