Alfe
Alfe

Reputation: 59426

How to gzip while uploading into s3 using boto

I have a large local file. I want to upload a gzipped version of that file into S3 using the boto library. The file is too large to gzip it efficiently on disk prior to uploading, so it should be gzipped in a streamed way during the upload.

The boto library knows a function set_contents_from_file() which expects a file-like object it will read from.

The gzip library knows the class GzipFile which can get an object via the parameter named fileobj; it will write to this object when compressing.

I'd like to combine these two functions, but the one API wants to read by itself, the other API wants to write by itself; neither knows a passive operation (like being written to or being read from).

Does anybody have an idea on how to combine these in a working fashion?

EDIT: I accepted one answer (see below) because it hinted me on where to go, but if you have the same problem, you might find my own answer (also below) more helpful, because I implemented a solution using multipart uploads in it.

Upvotes: 22

Views: 24774

Answers (4)

Steven Trojanowski
Steven Trojanowski

Reputation: 11

The solution presented by Alfie was not working for me, so I modified it and got it working. I was able to use it to transfer large files with some significant memory restraints.

s3 = boto3.client(
   's3',
    aws_access_key_id=<AWS_ACCESS_KEY_ID>,
    aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>

def upload_multipart_file_gz(s3, bucket, key, fileName, suffix='.gz'):
   key += suffix
   chunks = []
   response = s3.create_multipart_upload(
       Bucket=bucket,
       Key=key)
   upload_id = response['UploadId']
   stream = BytesIO()
   compressor = gzip.GzipFile(fileobj=stream, mode='w')

   def uploadPart(upload_id, partCount=[0]):
       partCount[0] += 1
       stream.seek(0)
       response = s3.upload_part(Bucket=bucket,
                             Key=key,
                             Body=stream,
                             PartNumber=partCount[0],
                             UploadId=upload_id)
       stream.seek(0)
       stream.truncate()
       chunk_data = {'ETag': response['ETag'], 'PartNumber': partCount[0]}
       return chunk_data

   with open(fileName, "rb") as inputFile:
       while True:  # until EOF
           chunk = inputFile.read(8192)
           if not chunk:  # EOF?
               compressor.close()
               chunk_data=uploadPart(upload_id)
               chunks.append(chunk_data)
            
            
               parts_dict = {'Parts': chunks}
               s3.complete_multipart_upload(Bucket=bucket,
                                            Key=key,
                                            MultipartUpload=parts_dict,
                                            UploadId=upload_id)
               break
           compressor.write(chunk)
           if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
               chunk_data=uploadPart(upload_id)
               chunks.append(chunk_data)

There are a few places for improvement, but it works.

Upvotes: 1

Rene B.
Rene B.

Reputation: 7364

You can also compress Bytes with gzip easily and upload it as the following easily:

import gzip
import boto3

cred = boto3.Session().get_credentials()

s3client = boto3.client('s3',
                            aws_access_key_id=cred.access_key,
                            aws_secret_access_key=cred.secret_key,
                            aws_session_token=cred.token
                            )

bucketname = 'my-bucket-name'      
key = 'filename.gz'  

s_in = b"Lots of content here"
gzip_object = gzip.compress(s_in)

s3client.put_object(Bucket=bucket, Body=gzip_object, Key=key)

It is possible to replace s_in by any Bytes, io.BytesIO, pickle dumps, files, etc.

If you want to upload compressed Json then here is a nice example: Upload compressed Json to S3

Upvotes: 10

Alfe
Alfe

Reputation: 59426

I implemented the solution hinted at in the comments of the accepted answer by garnaat:

import cStringIO
import gzip

def sendFileGz(bucket, key, fileName, suffix='.gz'):
    key += suffix
    mpu = bucket.initiate_multipart_upload(key)
    stream = cStringIO.StringIO()
    compressor = gzip.GzipFile(fileobj=stream, mode='w')

    def uploadPart(partCount=[0]):
        partCount[0] += 1
        stream.seek(0)
        mpu.upload_part_from_file(stream, partCount[0])
        stream.seek(0)
        stream.truncate()

    with file(fileName) as inputFile:
        while True:  # until EOF
            chunk = inputFile.read(8192)
            if not chunk:  # EOF?
                compressor.close()
                uploadPart()
                mpu.complete_upload()
                break
            compressor.write(chunk)
            if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
                uploadPart()

It seems to work without problems. And after all, streaming is in most cases just a chunking of the data. In this case, the chunks are about 10MB large, but who cares? As long as we aren't talking about several GB chunks, I'm fine with this.


Update for Python 3:

from io import BytesIO
import gzip

def sendFileGz(bucket, key, fileName, suffix='.gz'):
    key += suffix
    mpu = bucket.initiate_multipart_upload(key)
    stream = BytesIO()
    compressor = gzip.GzipFile(fileobj=stream, mode='w')

    def uploadPart(partCount=[0]):
        partCount[0] += 1
        stream.seek(0)
        mpu.upload_part_from_file(stream, partCount[0])
        stream.seek(0)
        stream.truncate()

    with open(fileName, "rb") as inputFile:
        while True:  # until EOF
            chunk = inputFile.read(8192)
            if not chunk:  # EOF?
                compressor.close()
                uploadPart()
                mpu.complete_upload()
                break
            compressor.write(chunk)
            if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
                uploadPart()

Upvotes: 29

garnaat
garnaat

Reputation: 45856

There really isn't a way to do this because S3 doesn't support true streaming input (i.e. chunked transfer encoding). You must know the Content-Length prior to upload and the only way to know that is to have performed the gzip operation first.

Upvotes: 7

Related Questions