rholdberh
rholdberh

Reputation: 557

How to zip files on s3 using lambda and python

I need to archive multiply files that exists on s3 and then upload the archive back to s3. I am trying to use lambda and python. As some of the files have more than 500MB, downloading in the '/tmp' is not an option. Is there any way to stream files one by one and put them in archive?

Upvotes: 5

Views: 19321

Answers (4)

Sandip Wankhede
Sandip Wankhede

Reputation: 1

# For me below code worked for single file in Glue job to take single .txt file form AWS S3 and make it zipped and upload back to AWS S3. 
import boto3
import zipfile
from io import BytesIO
import logging
logger = logging.getLogger()

s3_client = boto3.client('s3')
s3_resource= boto3.resource('s3')

# ZipFileStream function declaration
self._createZipFileStream(
                    bucketName="My_AWS_S3_bucket_name",
                    bucketFilePath="My_txt_object_prefix", 
                    bucketfileobject="My_txt_Object_prefix + txt_file_name",
                    zipKey="My_zip_file_prefix")

# ZipFileStream function Defination
def _createZipFileStream(self, bucketName: str, bucketFilePath: str, bucketfileobject: str, zipKey: str, ) -> None:
    try:
        obj = s3_resource.Object(bucket_name=bucketName, key=bucketfileobject)
        archive = BytesIO()

        with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
            with zip_archive.open(zipKey, 'w') as file1:
                file1.write(obj.get()['Body'].read())  

        archive.seek(0)

        s3_client.upload_fileobj(archive, bucketName, bucketFilePath + '/' + zipKey + '.zip')
        archive.close()
            
        # If you would like to delete the .txt after zipped from AWS S3 below code will work. 
        self._delete_object(
                bucket=bucketName, key=bucketfileobject)

    except Exception as e:
        logger.error(f"Failed to zip the txt file for {bucketName}/{bucketfileobject}: str{e}")

# Delete AWS S3 funcation defination.
def _delete_object(bucket: str, key: str) -> None:
        try:
            logger.info(f"Deleting: {bucket}/{key}")
            S3.delete_object(
                Bucket=bucket,
                Key=key
            )
        except Exception as e:
            logger.error(f"Failed to delete {bucket}/{key}: str{e}")`enter code here`

Upvotes: 0

macieks
macieks

Reputation: 481

AWS Lambda code: create zip from files by ext in bucket/filePath.


def createZipFileStream(bucketName, bucketFilePath, jobKey, fileExt, createUrl=False):
    response = {} 
    bucket = s3.Bucket(bucketName)
    filesCollection = bucket.objects.filter(Prefix=bucketFilePath).all() 
    archive = BytesIO()

    with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
        for file in filesCollection:
            if file.key.endswith('.' + fileExt):   
                with zip_archive.open(file.key, 'w') as file1:
                    file1.write(file.get()['Body'].read())  

    archive.seek(0)
    s3.Object(bucketName, bucketFilePath + '/' + jobKey + '.zip').upload_fileobj(archive)
    archive.close()

    response['fileUrl'] = None

    if createUrl is True:
        s3Client = boto3.client('s3')
        response['fileUrl'] = s3Client.generate_presigned_url('get_object', Params={'Bucket': bucketName,
                                                                                    'Key': '' + bucketFilePath + '/' + jobKey + '.zip'},
                                                              ExpiresIn=3600)

    return response
    

Upvotes: 6

Anilkumar Kalyane
Anilkumar Kalyane

Reputation: 119

Do not write to disk, stream to and from S3

Stream the Zip file from the source bucket and read and write its contents on the fly using Python back to another S3 bucket.

This method does not use up disk space and therefore is not limited by size.

The basic steps are:

  • Read the zip file from S3 using the Boto3 S3 resource Object into a BytesIO buffer object
  • Open the object using the zipfile module
  • Iterate over each file in the zip file using the namelist method
  • Write the file back to another bucket in S3 using the resource meta.client.upload_fileobj method

The Code Python 3.6 using Boto3

s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="bucket_name_here", key=zip_key)
buffer = BytesIO(zip_obj.get()["Body"].read())

z = zipfile.ZipFile(buffer)
for filename in z.namelist():
    file_info = z.getinfo(filename)
    s3_resource.meta.client.upload_fileobj(
        z.open(filename),
        Bucket=bucket,
        Key=f'{filename}'
    )

Note: AWS Execution time limit has a maximum of 15 minutes so can you process your HUGE files in this amount of time? You can only know by testing.

Upvotes: 7

John Rotenstein
John Rotenstein

Reputation: 269091

The /tmp/ directory is limited to 512MB for AWS Lambda functions.

If you search StackOverflow, you'll see some code from people who have created Zip files on-the-fly without saving files to disk. It becomes pretty complicated.

An alternative would be to attach an EFS filesystem to the Lambda function. It takes a bit of effort to setup, but the cost would be practically zero if you delete the files after use and you'll have plenty of disk space so your code will be more reliable and easier to maintain.

Upvotes: 0

Related Questions