Reputation: 136665
I know that UTF-8 supports way more characters than Latin-1 (even with the extensions). But are there examples of files that are valid in both, but the characters are different? So essentially that the content changes, depending on how you think the file is encoded?
I also know that a big chunk of Latin-1 maps 1:1 to the same part in UTF-8. The question is: which code points could change the value if interpreted differently (not invalid, but different)?
Upvotes: 3
Views: 1562
Reputation: 522500
Latin-1 is a single-byte encoding (meaning 1 character = 1 byte), which uses all possible byte values. So any byte maps to something in Latin-1. So literally any file is "valid" in Latin-1. So you can interpret any file as Latin-1 and you'll get… something… as a result.
So yes, interpret any valid UTF-8 file in Latin-1. It's valid both in UTF-8 and Latin-1. The first 128 characters are the same for both encodings and both based on ASCII; but if your UTF-8 file uses any non-ASCII characters, those will be interpreted as gibberish (yet valid) Latin-1.
bytes | encoding | text |
---|---|---|
e6bc a2e5 ad97 | UTF-8 | 漢字 |
e6bc a2e5 ad97 | Latin-1 | æ¼¢å 👈 valid but nonsensical |
Upvotes: 5
Reputation: 78945
Unicode is - somewhat simplified - a character set, and UTF-8 is one of multiple encodings for the binary representation of Unicode.
ISO-8859-1 is both, a character set and encoding.
At the character set level, ISO-8859-1 is a subset of Unicode, i.e. each ISO-8859-1 character also exists in Unicode, and the ISO-8859-1 code is even equal to the Unicode codepoint.
At the encoding level, ISO-8859-1 and UTF-8 use the same binary representation for the ISO-8859-1 characters up to 127. But for the characters between 128 and 255 they differ as UTF-8 needs 2 bytes to represent them.
Example:
Word | ISO-8859-1 | UTF-8 |
---|---|---|
Zürich | 5a fc 72 69 63 68 | 5a c3 bc 72 69 63 68 |
Upvotes: 2