Create a zip file on S3 from files on S3 in Java

I have a lot of files on S3 that I need to zip and then provide the zip via S3. Currently I zip them from stream to a local file and then upload the file again. This takes up a lot of disk space, as each file has around 3-10MB and I have to zip up to 100.000 files. So a zip can have more than 1TB. So I would like a solution just along this lines:

Create a zip file on S3 from files on S3 using Lambda Node

Here it seams the zip is created directly on S3 without taking up local disk space. But I am just not smart enough to transfer the above solution to Java. I am also finding conflicting information on the java aws sdk, saying that they planned on changing the stream behavior in 2017.

Not sure if this will help, but here's what I've been doing so far (Upload is my local model that holds S3 information). I just removed logging and stuff for better readability. I think I am not taking up space for the download "piping" the InputStream directly into the zip. But like I said I would also like to avoid the local zip file and create it directly on S3. That however would probably require the ZipOutputStream to be created with S3 as target instead of a FileOutputStream. Not sure how that can be done.

public File zipUploadsToNewTemp(List<Upload> uploads) {
    List<String> names = new ArrayList<>();

    byte[] buffer = new byte[1024];
    File tempZipFile;
    try {
      tempZipFile = File.createTempFile(UUID.randomUUID().toString(), ".zip");
    } catch (Exception e) {
      throw new ApiException(e, BaseErrorCode.FILE_ERROR, "Could not create Zip file");
    }
    try (
        FileOutputStream fileOutputStream = new FileOutputStream(tempZipFile);
        ZipOutputStream zipOutputStream = new ZipOutputStream(fileOutputStream)) {

      for (Upload upload : uploads) {
        InputStream inputStream = getStreamFromS3(upload);
        ZipEntry zipEntry = new ZipEntry(upload.getFileName());
        zipOutputStream.putNextEntry(zipEntry);
        writeStreamToZip(buffer, zipOutputStream, inputStream);
        inputStream.close();
      }
      zipOutputStream.closeEntry();
      zipOutputStream.close();
      return tempZipFile;
    } catch (IOException e) {
      logError(type, e);
      if (tempZipFile.exists()) {
        FileUtils.delete(tempZipFile);
      }
      throw new ApiException(e, BaseErrorCode.IO_ERROR,
          "Error zipping files: " + e.getMessage());
    }
}

  // I am not even sure, but I think this takes up memory and not disk space
private InputStream getStreamFromS3(Upload upload) {
    try {
      String filename = upload.getId() + "." + upload.getFileType();
      InputStream inputStream = s3FileService
          .getObject(upload.getBucketName(), filename, upload.getPath());
      return inputStream;
    } catch (ApiException e) {
      throw e;
    } catch (Exception e) {
      logError(type, e);
      throw new ApiException(e, BaseErrorCode.UNKOWN_ERROR,
          "Unkown Error communicating with S3 for file: " + upload.getFileName());
    }
}


private void writeStreamToZip(byte[] buffer, ZipOutputStream zipOutputStream,
      InputStream inputStream) {
    try {
      int len;
      while ((len = inputStream.read(buffer)) > 0) {
        zipOutputStream.write(buffer, 0, len);
      }
    } catch (IOException e) {
      throw new ApiException(e, BaseErrorCode.IO_ERROR, "Could not write stream to zip");
    }
}

And finally the upload Source code. Inputstream is created from the Temp Zip file.

public PutObjectResult upload(InputStream inputStream, String bucketName, String filename, String folder) {
    String uploadKey = StringUtils.isEmpty(folder) ? "" : (folder + "/");
    uploadKey += filename;

    ObjectMetadata metaData = new ObjectMetadata();

    byte[] bytes;
    try {
      bytes = IOUtils.toByteArray(inputStream);
    } catch (IOException e) {
      throw new ApiException(e, BaseErrorCode.IO_ERROR, e.getMessage());
    }
    metaData.setContentLength(bytes.length);
    ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);

    PutObjectRequest putObjectRequest = new PutObjectRequest(bucketPrefix + bucketName, uploadKey, byteArrayInputStream, metaData);
    putObjectRequest.setCannedAcl(CannedAccessControlList.PublicRead);

    try {
      return getS3Client().putObject(putObjectRequest);
    } catch (SdkClientException se) {
      throw s3Exception(se);
    } finally {
      IOUtils.closeQuietly(inputStream);
    }
  }

Just found a similar question to what I need also without answer:

Upload ZipOutputStream to S3 without saving zip file (large) temporary to disk using AWS S3 Java

Upvotes: 5

Answers (2)

Eugene Kukhol

Reputation: 21

You can get input stream from your S3 data, then zip this batch of bytes and stream it back to S3

        long numBytes;  // length of data to send in bytes..somehow you know it before processing the entire stream
        PipedOutputStream os = new PipedOutputStream();
        PipedInputStream is = new PipedInputStream(os);
        ObjectMetadata meta = new ObjectMetadata();
        meta.setContentLength(numBytes);

        new Thread(() -> {
            /* Write to os here; make sure to close it when you're done */
            try (ZipOutputStream zipOutputStream = new ZipOutputStream(os)) {
                ZipEntry zipEntry = new ZipEntry("myKey");
                zipOutputStream.putNextEntry(zipEntry);
                
                S3ObjectInputStream objectContent = amazonS3Client.getObject("myBucket", "myKey").getObjectContent();
                byte[] bytes = new byte[1024];
                int length;
                while ((length = objectContent.read(bytes)) >= 0) {
                    zipOutputStream.write(bytes, 0, length);
                }
                objectContent.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }).start();
        amazonS3Client.putObject("myBucket", "myKey", is, meta);
        is.close();  // always close your streams

Upvotes: 2

John Rotenstein

Reputation: 270224

I would suggest using an Amazon EC2 instance (as low as 1c/hour, or you could even use a Spot Instance to get it at a lower price). Smaller instance types are lower cost but have limited bandwidth, so play around with the size to get your preferred performance.

Write a script to loop through the files then:

Download
Zip
Upload
Delete local files

All the zip magic happens on local disk. No need to use streams. Just use the Amazon S3 download_file() and upload_file() calls.

If the EC2 instance is in the same region as Amazon S3 then there is no Data Transfer charge.

Upvotes: 0

Create a zip file on S3 from files on S3 in Java

Answers (2)

Related Questions