Dervin Thunk
Dervin Thunk

Reputation: 20119

Guessing UTF-8 encoding

I have a question that may be quite naive, but I feel the need to ask, because I don't really know what is going on. I'm on Ubuntu.

Suppose I do

echo "t" > test.txt

if I then

file test.txt

I get test.txt:ASCII text

If I then do

echo "å" > test.txt

Then I get

test.txt: UTF-8 Unicode text

How does that happen? How does file "know" the encoding, or, alternatively, how does it guess it?

Thanks.

Upvotes: 4

Views: 1375

Answers (4)

David Z
David Z

Reputation: 131580

There are certain byte sequences that suggest that UTF-8 encoding may be in use (see Wikipedia). If file finds one or more of those and doesn't find anything that can't occur in UTF-8, it's a fair guess that the file is encoded in UTF-8. But again, just a guess. For the basic ASCII character set (normal characters like 't'), the binary representation is the same in most common encodings (including UTF-8), so if a file contains only basic ASCII characters, file has no way to tell which of the many ASCII-compatible encodings was intended. It just goes with ASCII by default.

The other thing to take note of is that your shell is set to use UTF-8, which is why the file gets written in UTF-8 in the first place. Conceivably, you could set the shell to use another encoding like UTF-16, and then the command

echo "å" > test.txt

would write a file using UTF-16.

Upvotes: 5

Isaac
Isaac

Reputation: 2412

It inserts a BOM on the very beginning of the file.

BOM (Byte-Oder Mark) tell Editors the Encoding of the file (and other things like big/little endian encoding)

You can find out the existence of BOM be checking the file size. It's more than 2 bytes (i guess it's 4 or 5 bytes).

This Article about BOMs in Wikipedia can help much.


Update:

Yes, I was wrong.

Even there is BOM for UTF-8 but most of editors do NOT insert BOM at the beginning because BOM codes are ASCII incompatible and one of the goals of UTF-8 design is ASCII compatibility. So it's really bad to insert BOM for UTF-8 !

So the editors really guess if files encoded in UTF-8 or not.


So Another Question!:

It seems that there are the possibility that Editors guess wrong about real encoding of a file. Is such situations rare? It's clear that smaller texts have more chance for this situation.

Upvotes: 3

Artelius
Artelius

Reputation: 49089

UTF-8 is "ASCII-friendly", in the sense that a text file consisting only of ASCII characters will be exactly the same, whether it is encoded with ASCII or UTF-8.

Note: some people think there are 256 ASCII characters. There are only 128. The ISO-8859-x is a family of encodings whose first 128 characters are ASCII and the rest are other characters.

Also, UTF-8 is very well-designed, and gives you several properties, for instance, some characters are encoded with 1 byte, some with 2, 3, or 4 - but a 4-byte character will never contain the bytes of any shorter character, and nor will a 3 or 2 byte character. All 1-byte characters are encoded with bytes 0 to 127, while all longer characters are encoded as a sequence of bytes in the range 128 to 255.

A non-UTF-8 byte stream (for instance, a binary file, or a UTF-16 file) usually can be ruled out as UTF-8, because it is likely to violate such properties. The only exception is plain ASCII files which of course can be harmlessly interpreted as UTF-8 anyway.

So in short, UTF-8 files can be detected as such because most "random" byte sequences are illegal in UTF-8, and so something that doesn't violate any rules is quite likely to be UTF-8.

Upvotes: 3

schnaader
schnaader

Reputation: 49719

From the file manpage:

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as ''text'' because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only ''character data'' because, while they contain text, it is text that will require translation before it can be read. In addition, file will attempt to determine other characteristics of text-type files. If the lines of a file are terminated by CR, CRLF, or NEL, instead of the Unix-standard LF, this will be reported. Files that contain embedded escape sequences or overstriking will also be identified.

Upvotes: 4

Related Questions