Unable to decompress gzipped files after uploading input stream chunks in S3

Question

I'd like to take my input stream and upload gzipped parts to s3 in a similar fashion to the multipart uploader. However, I want to store the individual file parts in S3 and not turn the parts into a single file.

To do so, I have created the following methods. But, when I try to gzip decompress each part gzip throws an error and says: gzip: file_part_2.log.gz: not in gzip format.

I'm not sure if I am compressing each part correctly?

If I re-initialise the gzipoutputstream: gzip = new GZIPOutputStream(baos); and set gzip.finish() after reseting the byte array output stream baos.reset(); then I am able to decompress each part. Not sure why I need todo this, is there a similar reset for the gzipoutputstream?

public void upload(String bucket, String key, InputStream is, int partSize) throws Exception
{
    String row;
    BufferedReader br = new BufferedReader(new InputStreamReader(is, ENCODING));
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    GZIPOutputStream gzip = new GZIPOutputStream(baos);

    int partCounter = 0;
    int lineCounter = 0;
    while ((row = br.readLine()) != null) {
        if (baos.size() >= partSize) {
            partCounter = this.uploadChunk(bucket, key, baos, partCounter);

            baos.reset();
        }else if(!row.equals("")){
            row += '
';
            gzip.write(row.getBytes(ENCODING));
            lineCounter++;
        }
    }

    gzip.finish();
    br.close();
    baos.close();

    if(lineCounter == 0){
        throw new Exception("Aborting upload, file contents is empty!");
    }

    //Final chunk
    if (baos.size() > 0) {
        this.uploadChunk(bucket, key, baos, partCounter);
    }
}

private int uploadChunk(String bucket, String key, ByteArrayOutputStream baos, int partCounter)
{
    ObjectMetadata metaData = new ObjectMetadata();
    metaData.setContentLength(baos.size());

    String[] path = key.split("/");
    String[] filename = path[path.length-1].split("\.");

    filename[0] = filename[0]+"_part_"+partCounter;

    path[path.length-1] = String.join(".", filename);

    amazonS3.putObject(
            bucket,
            String.join("/", path),
            new ByteArrayInputStream(baos.toByteArray()),
            metaData
    );

    log.info("Upload chunk {}, size: {}", partCounter, baos.size());

    return partCounter+1;
}

guest · Accepted Answer

The problem is that you're using a single GZipOutputStream for all chunks. So you're actually writing pieces of a GZipped file, which would have to be recombined to be useful.

Making the minimal change to your existing code:

if (baos.size() >= partSize) {
    gzip.close(); 
    partCounter = this.uploadChunk(bucket, key, baos, partCounter);
    baos = baos = new ByteArrayOutputStream();
    gzip = new GZIPOutputStream(baos);
}

You need to do the same at the end of the loop. Also, you shouldn't throw an exception if the line counter is 0: it's entirely possible that the file is exactly divisible into a set number of chunks.

To improve the code, I would wrap the GZIPOutputStream in an OutputStreamWriter and a BufferedWriter, so that you don't need to do the string-bytes conversion explicitly.

And lastly, don't use ByteArrayOutputStream.reset(). It doesn't save you anything over just creating a new stream, and opens the door for errors if you ever forget to reset.

Unable to decompress gzipped files after uploading input stream chunks in S3

Answers (1)

Related Questions