Reputation: 160
I have a file of character encoding set to ANSI, however I can still copy a character of character set UTF-8. Are character sets defined on the file forced on the entire file? I am trying to understand how character sets works. Thanks
Upvotes: 0
Views: 72
Reputation: 299275
Files are bytes. They are long sequences of numbers. In most operating systems, that's all they are. There is no "encoding" attached to the file. The file is bytes.
It is up to software to interpret those bytes as having some meaning. For example, there is nothing fundamentally different between an "picture file" and a "text file." Both are just long sequences of numbers. But software interprets the "picture file" using some encoding rules to create a picture. Similarly, software interprets the "text file" using some encoding rules.
Most text file formats do not include their encoding anywhere the format. It's up to the software to know or infer what it is. Sometimes the operating system assists here and provides additional metadata that's not in the file, like filename extensions. This generally doesn't help for text files, since in most systems text files do not have different extensions based on their encoding.
There are many character encodings in ANSI that exactly match UTF-8 encodings. So just looking at a file, it may be impossible to tell which encoding it was written with, since it could be identical in both. There are byte sequences that are illegal in UTF-8, so it is possible to determine that file is not valid UTF-8, but all byte sequences are valid ANSI (though there are byte sequences that are very rare, and so can be used to guess that it's not ANSI).
(I assume you mean Windows-1252; there isn't really such a thing as "ANSI" encoding.)
Upvotes: 3