Reputation: 328614
A common mistake when writing code that reads text from a stream in Java is to forget to specify the encoding. If you don't specify anything, Java will use the platform default encoding which eventually causes problems ("But it works on my computer!").
To find these problems, I want to use an uncommon default encoding that should break as many I/O operations as possible. The idea is that at least any character outside ASCII will be mangled.
Most of our documents use UTF-8 encoding. ISO-8859-1 might work because it simply preserves the input (it's a 1:1 mapping between bytes and characters). Any umlauts would be read a two/tree byte sequences. But I'm wondering if we could do better.
Which encoding do you suggest to use from the list of supported encodings?
Upvotes: 4
Views: 4711
Reputation: 11543
I think any of the 16 or 32 bit UTF would give you a lot of "null" characters which should break a lot of strings. Also using one with a BOM (byte order marker) should further "break" the file.
But I'd guess there are code analysis tools which could check for creating Strings, Readers and Writers with no encoding.
Edit: FindBugs seem to be able to do this: Dm: Reliance on default encoding (DM_DEFAULT_ENCODING)
Upvotes: 1
Reputation: 115338
java.nio.charset.Charset
has method newDecoder()
that returns Decoder
. Deconder has methods isAutoDetecting()
, isChasetDetected()
and detectedCharset()
that seem to be useful for your task. Unfortunately all these methods are optional.
I think that you should take all available Charsets (Charset.availableCharsets()
) and first check whether they are autodetectable. So, when you get new stream try first to use the built-in mechanism of autododetecting for those charsets that implement these optional operations.
If no-one of these decoders can detect chaset you should try to decode the stream (as you explained) trying to apply other charsets. To optimize process try to sort the charsets using the following criteria.
National alphabets first. For example try Cyrillic charsets before those that deal with Latin alphabets.
Among national alphabets take one that has more characters. For example Japanese and Chinese will be in the beginning of the queue.
The reason for this strategy is that you want to fail as faster as you can. If your text does not contain Japanese characters you have to check the first character from your stream to understand that it is not Japanese. But if you try to use ASCII charset to for decoding of French text you will probably have to read a lot of characters before you see the first è
.
Upvotes: 1
Reputation: 718886
A default encoding of UTF-16 has a good chance of "mangling" any document that isn't UTF-16.
But I think you are going about this the wrong way. A better way to detect dodgy code that relies on default encodings is to write some custom rules for something like PMD. Just look for code that uses the offending methods and constructors on String
, the IO classes and so on.
(The problem with the "use a weird default encoding" approach is that your testing may not be sufficient to exercise all of the offending code, or it might exercise the code but not detect the mangling.)
Upvotes: 2