J Fabian Meier
J Fabian Meier

Reputation: 35785

Encoding agnostic way to read "German" text files

All our text-based files are encoded in UTF-8 or latin-1 (Windows). The only "special characters" we use are the German umlauts ä, ö, ü and the ß.

For different reasons (including historical, but also the old problem of "properties files cannot be UTF-8"), we cannot unify our encoding completely.

This obviously leads to errors when people read a text file in Java and use the wrong encoding.

Is there an easy, reliable way to detect whether a file is UTF-8 or latin-1 if you know that the only possible special characters are the ones indicated above.

Or do I need to read the file as byte array and search for special bytes?

Upvotes: 5

Views: 2224

Answers (1)

assylias
assylias

Reputation: 328568

If the only non-ASCII characters are "ä, ö, ü and the ß" then you could use the fact that their first code is 195 (-61 as a byte) in UTF_8. Character 195 is à in ISO 8859 which apparently you don't expect to find.

So a solution could be something like this:

public static String readFile(Path p) throws IOException {
  byte[] bytes = Files.readAllBytes(p);
  boolean isUtf8 = false;
  for (byte b : bytes) {
    if (b == -61) {
      isUtf8 = true;
      break;
    }
  }
  return new String(bytes, isUtf8 ? StandardCharsets.UTF_8 : StandardCharsets.ISO_8859_1);
}

This is of course quite fragile and won't work if the file contains other special characters.

Upvotes: 2

Related Questions