Reputation: 6145

Zip an entire directory on S3

If I have a directory with ~5000 small files on S3, is there a way to easily zip up the entire directory and leave the resulting zip file on S3? I need to do this without having to manually access each file myself.

Upvotes: 29

Answers (5)

Rahul Makhija

Reputation: 609

If you intend to download the s3 bucket content, without the zipping criteria, then just use aws-cli to do this

aws s3 sync s3://your-bucket-name /local/destination/folder

Upvotes: 1

Omar Dasser

Reputation: 29

The following just worked for me :

def ListDir(bucket_name, prefix, file_type='.pgp'):
    #file_type can be set to anything you need
    s3 = boto3.client('s3')
    files = []    
    paginator = s3.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=bucket_name, Prefix=prefix)
    for page in pages:
        for obj in page['Contents']:
            files.append(obj['Key'])

    if files:
        files = [f for f in files if file_type in f]
    return files

def ZipFolder(bucket_name,prefix):
    files = ListDir(bucket_name,prefix)
    s3 = boto3.client("s3")
    zip_buffer = io.BytesIO()
    for ind,file in enumerate(files):
        file = file.split("/")[-1]
        print(f"Processing file {ind} : {file}")
        object_key = prefix+file
        print(object_key)
        
        with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zipper:
            infile_object = s3.get_object(Bucket=bucket_name, Key=object_key) 
            infile_content = infile_object['Body'].read()
            zipper.writestr(file, infile_content)
    s3.put_object(Bucket=bucket_name, Key=prefix + YOUR_ZIP_FILENAME, Body=zip_buffer.getvalue())

Upvotes: 1

Eugenijus S.

Reputation: 384

You can try using s3browser app on Windows.

Upvotes: 0

Edwinner

Reputation: 2667

I agree with @BraveNewCurrency answer.
You would need your own server to do this effectively as AWS S3 is just a key-value storage in the real sense.
Command line tools will not work, as there are too many files and arguments.

You do however have some options that might not be so free or easy to setup.

PAID OPTIONS
I am actually involved with a cheap commercial project that just does that. They provide both an API, and an option to launch your own pre-configured EC2 zipper server.
https://s3zipper.com/
https://docs.s3zipper.com

Large migrations(Terabyte->Petabyte-scale)
AWS Snowball

FREE OPTIONS
You can also build your own servers using the following free packages(JavaScript & Go(Golang)):
https://github.com/orangewise/s3-zip
https://github.com/DanielHindi/aws-s3-zipper
https://github.com/Teamwork/s3zipper

Upvotes: 14

BraveNewCurrency

Reputation: 13065

No, there is no magic bullet.

(As an aside, you have to realize that there is no such thing as a "directory" in S3. There are only objects with paths. You can get directory-like listings, but the '/' character isn't magic - you can get prefixes by any character you want.)

As someone pointed out, "pre-zipping" them can help both download speed and append speed. (At the expense of duplicate storage.)

If downloading is the bottleneck, it sounds like your are downloading serially. S3 can support 1000's of simultaneous connections to the same object without breaking a sweat. You'll need to run benchmarks to see how many connections are best, since too many connections from one box might get throttled by S3. And you may need to do some TCP tuning when doing 1000's of connections per second.

The "solution" depends heavily on your data access patterns. Try re-arranging the problem. If your single-file downloads are infrequent, it might make more sense to group them 100 at a time into S3, then break them apart when requested. If they are small files, it might make sense to cache them on the filesystem.

Or it might make sense to store all 5000 files as one big zip file in S3, and use a "smart client" that can download specific ranges of the zip file in order to serve the individual files. (S3 supports byte ranges, as I recall.)

Upvotes: 18

Zip an entire directory on S3

Answers (5)

You do however have some options that might not be so free or easy to setup.

Related Questions