Reputation: 1743

Streaming decompression of zip archives in python

Is there a way to do streaming decompression of single-file zip archives?

I currently have arbitrarily large zipped archives (single file per archive) in s3. I would like to be able to process the files by iterating over them without having to actually download the files to disk or into memory.

A simple example:

import boto

def count_newlines(bucket_name, key_name):
    conn = boto.connect_s3()
    b = conn.get_bucket(bucket_name)
    # key is a .zip file
    key = b.get_key(key_name)

    count = 0
    for chunk in key:
        # How should decompress happen?
        count += decompress(chunk).count('\n')

    return count

This answer demonstrates a method of doing the same thing with gzip'd files. Unfortunately, I haven't been able to get the same technique to work using the zipfile module, as it seems to require random access to the entire file being unzipped.

Upvotes: 5

Answers (5)

Michal Charemza

Reputation: 27012

While I suspect it's not possible with absolutely all zip files, I also suspect almost(?) all modern zip files are streaming-compatible, and it is possible to do streaming decompression, for example using https://github.com/uktrade/stream-unzip [full disclosure: originally written by me]

The example from its README shows how to do this with an arbitrary http request using httpx

from stream_unzip import stream_unzip
import httpx

def zipped_chunks():
    # Any iterable that yields a zip file
    with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
        yield from r.iter_bytes(chunk_size=65536)

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
    for chunk in unzipped_chunks:
        print(chunk)

but I think it could be adapted for boto3 to stream unzip/decompress from S3 (untested):

from stream_unzip import stream_unzip
import boto3

def zipped_chunks():
    yield from boto3.client('s3', region_name='us-east-1').get_object(
        Bucket='my-bucket-name',
        Key='the/key/of/the.zip'
    )['Body'].iter_chunks()

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
    for chunk in unzipped_chunks:
        print(chunk)

Upvotes: 3

doki_pen

Reputation: 106

You can use https://pypi.python.org/pypi/tubing, it even has built in s3 source support using boto3.

from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
    | pipes.Gunzip() \
    | pipes.Split(on=b'\n') \
    | sinks.Objects()
print len(output)

If you didn't want to store the entire output in the returned sink, you could make your own sink that just counts. The impl would look like:

class CountWriter(object):
    def __init__(self):
        self.count = 0
    def write(self, chunk):
        self.count += len(chunk)
Counter = sinks.MakeSink(CountWriter)

Upvotes: 1

Ramin Halavati

Reputation: 107

You can do it in Python 3.4.3 using ZipFile as follows:

with ZipFile('spam.zip') as myzip:
    with myzip.open('eggs.txt') as myfile:
        print(myfile.read())

Python Docs

Upvotes: -3

Mark Adler

Reputation: 112349

Yes, but you'll likely have to write your own code to do it if it has to be in Python. You can look at sunzip for an example in C for how to unzip a zip file from a stream. sunzip creates temporary files as it decompresses the zip entries, and then moves those files and sets their attributes appropriately upon reading the central directory at the end. Claims that you must be able to seek to the central directory in order to properly unzip a zip file are incorrect.

Upvotes: 2

rezca

Reputation: 116

The zip header is at the end of the file, which is why it needs random access. See https://en.wikipedia.org/wiki/Zip_(file_format)#Structure.

You could parse the local file header which should be at the start of the file for a simple zip, and decompress the bytes with zlib (see zipfile.py). This is not a valid way to read a zip file, and while it might work for your specific scenario, it could also fail on a lot of valid zips. Reading the central directory file header is the only right way to read a zip.

Upvotes: 1

Streaming decompression of zip archives in python

Answers (5)

Related Questions