Reputation: 130
I ran the following command in a software repository I have access to:
find . -not -name ".svn" -type f -exec file "{}" \;
and saw many output lines like
./File.java: ISO-8859 C++ program text
What does that mean? ISO-8859 is an encoding class, not a certain encoding. I've expected all files to be UTF-8, but most are in the presented encoding. Is ISO-8859 a proper subset of UTF-8, too?
Is it possible for me to convert all those files safely by using ISO-8859-1 as source encoding while translating it into UTF-8 with iconv
for example?
Upvotes: 1
Views: 5838
Reputation: 91025
The charset detection used by file
is rather simplistic. It recognizes UTF-8. And it distinguished between "ISO-8859" and "non-ISO extended-ASCII" by looking for bytes in the 0x80-0x9F range where the ISO 8859 encodings have "holes". But it makes no attempt to determine which ISO 8859 encoding is in use. Which is why it just says ISO-8859
instead of ISO-8859-1
or ISO-8859-15
.
Upvotes: 0
Reputation: 80405
I am afraid that the Unix file
program is rather bad at this. It just means it is in a byte encoding. It does not mean that it is ISO-8859-1. It might even be in a non-ISO byte encdidng, although it usually figures that out.
I have a system that does much better than file, but it is trained on an English-language corpus, so might not do as well as on German.
The short answer is that the result of file
is not reliable. You have to know the real encoding to up-convert it.
Upvotes: 1