craigmiller160
craigmiller160

Reputation: 6283

Java: InputStreamReader character encoding needs to be run twice

My company does a lot of work with XML transformations with clients all over the world. As such, we do at times encounter character encoding issues. We have a component of our application which is designed to normalize an InputStream to a specific character encoding. It works well... but with a catch.

In some cases, we need to run it twice. For the life of me I can't tell you why, I've been trying to hard to figure out what is causing it, and I come up with nothing. It just seems that there are some files where when it runs the first time, it doesn't make it right, but when it runs the second time everything is finally good.

Here is the code that does the encoding (assume that the "encoding" variable is "UTF-8", it usually is):

char[] buffer = new char[getBufferSize()];
String encoding = getEncoding();

Cache fileCache = getFileCache();

try (InputStreamReader reader = new InputStreamReader(data.getDataStream(), encoding); Writer writer = fileCache.getWriter(encoding)) {
    int charsRead;
    while ((charsRead = reader.read(buffer)) != -1) {
        writer.write(buffer, 0, charsRead);
    }
    data.setDataStream(fileCache.getInputStream());
} catch(IOException ex) {
    throw new Exception(String.format("Unable to normalize stream for %s encoding", encoding), ex);
}

So, sometimes this code needs to be run twice to make a stream behave with the specified encoding properly.

I want to make it run better on the first try.

  1. What possible causes could there be for this issue?

  2. Is there any way to improve this code to make the "stream normalization" (as we call it) more effective?

  3. Other than using InputStreamReader, what alternative methods of fixing stream encoding are there that might work better?

Upvotes: 0

Views: 174

Answers (1)

Joop Eggen
Joop Eggen

Reputation: 109613

With XML there are minor problems: In the first line <?xml ... ?> specifies an encoding or defaults to UTF-8. Hence often XML is read as InputStream (binary) and left to the XML parser to find out the encoding.

When writing XML, one may assume to have it say in a String. When writing the encoding from that <?xml ... ?> should be used in an new OutputStreamWriter(ouputStream, encoding).

Binary input and output XMLs must be tested for their encoding, in a programmer's editor like JEdit or Notepad++, that handles encodings.

IF you want to read the text immediately in the right encoding: I did a search for XMLInputStreamReader and found some. But all your Reader class needs to do: buffer the first bytes in an ByteArrayOutputStream, till the <?xml encoding=...?> is handled, and then do an InputStreamReader.

Upvotes: 2

Related Questions