x89
x89

Reputation: 3450

how to gzip files in tmp folder

Using an AWS Lambda function, I download an S3 zipped file and unzip it.

For now I do it using extractall. Upon unzipping, all files are saved in the tmp/ folder.

s3.download_file('test','10000838.zip','/tmp/10000838.zip')

with zipfile.ZipFile('/tmp/10000838.zip', 'r') as zip_ref:
    lstNEW = list(filter(lambda x: not x.startswith("__MACOSX/"), zip_ref.namelist()))
    zip_ref.extractall('/tmp/', members=lstNEW)

After unzipping, I want to gzip files and place them in another S3 bucket.

Now, how can I read all files from the tmp folder again and gzip each file? $item.csv.gz

I see this (https://docs.python.org/3/library/gzip.html) but I am not sure which function is to be used.

If it's the compress function, how exactly do I use it? I read in this answer gzip a file in Python that I can use the open function gzip.open('', 'wb') to gzip a file but I couldn't figure out how to use it in my case. In the open function, do I specify the target location or the source location? Where do I save the gzipped files such as that I can later save them to S3?

Alternative Option:

Instead of loading everything into the tmp folder, I read that I can also open an output stream, wrap the output stream in a gzip wrapper, and then copy from one stream to the other

with zipfile.ZipFile('/tmp/10000838.zip', 'r') as zip_ref:
    testList = []
    for i in zip_ref.namelist():
        if (i.startswith("__MACOSX/") == False):
            testList.append(i)
    for i in testList:
        zip_ref.open(i, ‘r’)

but then again I am not sure how to continue in the for loop and open the stream and convert files there

Upvotes: 0

Views: 726

Answers (1)

JonSG
JonSG

Reputation: 13067

Depending on the sizes of the files, I would skip writing the .gz file(s) to disk. Perhaps something based on s3fs | boto and gzip.

import contextlib
import gzip

import s3fs

AWS_S3 = s3fs.S3FileSystem(anon=False) # AWS env must be set up correctly

source_file_path = "/tmp/your_file.txt"
s3_file_path = "my-bucket/your_file.txt.gz"

with contextlib.ExitStack() as stack:
    source_file = stack.enter_context(open(source_file_path , mode="rb"))
    destination_file = stack.enter_context(AWS_S3.open(s3_file_path, mode="wb"))
    destination_file_gz = stack.enter_context(gzip.GzipFile(fileobj=destination_file))
    while True:
        chunk = source_file.read(1024)
        if not chunk:
            break
        destination_file_gz.write(chunk)

Note: I have not tested this so if it does not work, let me know.

Upvotes: 1

Related Questions