Brian Glynn
Brian Glynn

Reputation: 13

Corrupt Gzip string due to character encoding

I have some corrupted Gzip log files that I'm trying to restore. The files were transfered to our servers through a Java backed web page. The files have always been sent as plain text, but we recently started to receive log files that were Gzipped. These Gzipped files appear to be corrupted, and are not unzip-able, and the originals have been deleted. I believe this is from the character encoding in the method below.

Is there any way to revert the process to restore the files to their original zipped format? I have the resulting Strings binary array data in a database blob.

Thanks for any help you can give!

 private String convertStreamToString(InputStream is) throws IOException {
    /*
     * To convert the InputStream to String we use the
     * Reader.read(char[] buffer) method. We iterate until the
     * Reader return -1 which means there's no more data to
     * read. We use the StringWriter class to produce the string.
     */
    if (is != null) {
        Writer writer = new StringWriter();

        char[] buffer = new char[1024];
        try {
            Reader reader = new BufferedReader(
                    new InputStreamReader(is, "UTF-8"));
            int n;
            while ((n = reader.read(buffer)) != -1) {
                writer.write(buffer, 0, n);
            }
        } finally {
            is.close();
        }
        return writer.toString();
    } else {
        return "";
    }
}

Upvotes: 1

Views: 4942

Answers (2)

MJB
MJB

Reputation: 9399

If that's the code you have you never should have converted to a Reader (or in fact a String). Only preserving as a Stream (or byte array) would avoid corrupting binary files. And once it's read into the string....illegal sequences (and there are many in utf-8) WILL be discarded.

So no, unless you are quite lucky, there is no way to recover the info. You'll have to provide another process where you process the pure stream and insert as a pure BLOB not a CLOB

Upvotes: 2

Joachim Sauer
Joachim Sauer

Reputation: 308269

If this is the method that was used to convert the InputStream to a String, then your data is almost certainly lost.

The problem is that UTF-8 has quite a few byte sequences that are simply not legal (i.e. they don't represent any value). These sequences will be replaced with the Unicode replacement character.

That character is the same no matter which invalid byte sequence was decoded. Therefore the specific information in those bytes is lost.

Upvotes: 2

Related Questions