KoenigGunther
KoenigGunther

Reputation: 130

What does ISO-8859 in `file` mean?

I ran the following command in a software repository I have access to:

find . -not -name ".svn" -type f -exec file "{}" \;

and saw many output lines like

./File.java: ISO-8859 C++ program text

What does that mean? ISO-8859 is an encoding class, not a certain encoding. I've expected all files to be UTF-8, but most are in the presented encoding. Is ISO-8859 a proper subset of UTF-8, too?

Is it possible for me to convert all those files safely by using ISO-8859-1 as source encoding while translating it into UTF-8 with iconv for example?

Upvotes: 1

Views: 5838

Answers (2)

dan04
dan04

Reputation: 91025

The charset detection used by file is rather simplistic. It recognizes UTF-8. And it distinguished between "ISO-8859" and "non-ISO extended-ASCII" by looking for bytes in the 0x80-0x9F range where the ISO 8859 encodings have "holes". But it makes no attempt to determine which ISO 8859 encoding is in use. Which is why it just says ISO-8859 instead of ISO-8859-1 or ISO-8859-15.

Upvotes: 0

tchrist
tchrist

Reputation: 80405

I am afraid that the Unix file program is rather bad at this. It just means it is in a byte encoding. It does not mean that it is ISO-8859-1. It might even be in a non-ISO byte encdidng, although it usually figures that out.

I have a system that does much better than file, but it is trained on an English-language corpus, so might not do as well as on German.

The short answer is that the result of file is not reliable. You have to know the real encoding to up-convert it.

Upvotes: 1

Related Questions