Reputation: 1743
Is there a way to do streaming decompression of single-file zip archives?
I currently have arbitrarily large zipped archives (single file per archive) in s3. I would like to be able to process the files by iterating over them without having to actually download the files to disk or into memory.
A simple example:
import boto
def count_newlines(bucket_name, key_name):
conn = boto.connect_s3()
b = conn.get_bucket(bucket_name)
# key is a .zip file
key = b.get_key(key_name)
count = 0
for chunk in key:
# How should decompress happen?
count += decompress(chunk).count('\n')
return count
This answer demonstrates a method of doing the same thing with gzip'd files. Unfortunately, I haven't been able to get the same technique to work using the zipfile
module, as it seems to require random access to the entire file being unzipped.
Upvotes: 5
Views: 5931
Reputation: 27012
While I suspect it's not possible with absolutely all zip files, I also suspect almost(?) all modern zip files are streaming-compatible, and it is possible to do streaming decompression, for example using https://github.com/uktrade/stream-unzip [full disclosure: originally written by me]
The example from its README shows how to do this with an arbitrary http request using httpx
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Any iterable that yields a zip file
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes(chunk_size=65536)
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
but I think it could be adapted for boto3 to stream unzip/decompress from S3 (untested):
from stream_unzip import stream_unzip
import boto3
def zipped_chunks():
yield from boto3.client('s3', region_name='us-east-1').get_object(
Bucket='my-bucket-name',
Key='the/key/of/the.zip'
)['Body'].iter_chunks()
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
Upvotes: 3
Reputation: 106
You can use https://pypi.python.org/pypi/tubing, it even has built in s3 source support using boto3.
from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
| pipes.Gunzip() \
| pipes.Split(on=b'\n') \
| sinks.Objects()
print len(output)
If you didn't want to store the entire output in the returned sink, you could make your own sink that just counts. The impl would look like:
class CountWriter(object):
def __init__(self):
self.count = 0
def write(self, chunk):
self.count += len(chunk)
Counter = sinks.MakeSink(CountWriter)
Upvotes: 1
Reputation: 107
You can do it in Python 3.4.3 using ZipFile as follows:
with ZipFile('spam.zip') as myzip:
with myzip.open('eggs.txt') as myfile:
print(myfile.read())
Upvotes: -3
Reputation: 112349
Yes, but you'll likely have to write your own code to do it if it has to be in Python. You can look at sunzip for an example in C for how to unzip a zip file from a stream. sunzip creates temporary files as it decompresses the zip entries, and then moves those files and sets their attributes appropriately upon reading the central directory at the end. Claims that you must be able to seek to the central directory in order to properly unzip a zip file are incorrect.
Upvotes: 2
Reputation: 116
The zip header is at the end of the file, which is why it needs random access. See https://en.wikipedia.org/wiki/Zip_(file_format)#Structure.
You could parse the local file header which should be at the start of the file for a simple zip, and decompress the bytes with zlib
(see zipfile.py). This is not a valid way to read a zip file, and while it might work for your specific scenario, it could also fail on a lot of valid zips. Reading the central directory file header is the only right way to read a zip.
Upvotes: 1