Reputation: 35785
All our text-based files are encoded in UTF-8 or latin-1 (Windows). The only "special characters" we use are the German umlauts ä, ö, ü and the ß.
For different reasons (including historical, but also the old problem of "properties files cannot be UTF-8"), we cannot unify our encoding completely.
This obviously leads to errors when people read a text file in Java and use the wrong encoding.
Is there an easy, reliable way to detect whether a file is UTF-8 or latin-1 if you know that the only possible special characters are the ones indicated above.
Or do I need to read the file as byte array and search for special bytes?
Upvotes: 5
Views: 2224
Reputation: 328568
If the only non-ASCII characters are "ä, ö, ü and the ß" then you could use the fact that their first code is 195 (-61 as a byte) in UTF_8. Character 195 is Ã
in ISO 8859 which apparently you don't expect to find.
So a solution could be something like this:
public static String readFile(Path p) throws IOException {
byte[] bytes = Files.readAllBytes(p);
boolean isUtf8 = false;
for (byte b : bytes) {
if (b == -61) {
isUtf8 = true;
break;
}
}
return new String(bytes, isUtf8 ? StandardCharsets.UTF_8 : StandardCharsets.ISO_8859_1);
}
This is of course quite fragile and won't work if the file contains other special characters.
Upvotes: 2