djmac
djmac

Reputation: 905

Unzip nested zip files in python

I am looking for a way to unzip nested zip files in python. For example, consider the following structure (hypothetical names for ease):

...etc. I am trying to access text files that are within the second zip. I certainly don't want to extract everything, as the shear numbers would crash the computer (there is several hundred zips in the first layer, and almost 10,000 in the second layer (per zip)).

I have been playing around with the 'zipfile' module - I am able open the 1st level of zipfiles. E.g.:

zipfile_obj = zipfile.ZipFile("/Folder/ZipfileA.zip")
next_layer_zip = zipfile_obj.open("ZipfileA1.zip")

However, this returns a "ZipExtFile" instance (not a file or zipfile instance) - and I can't then go on and open this particular data type. That I can't do this:

data = next_layer_zip.open(data.txt)

I can however "read" this zip file file with:

next_layer_zip.read()

But this is entirely useless! (i.e. can only read compressed data/goobledigook).

Does anyone have any ideas on how I might go about this (without using ZipFile.extract)??

I came across this, http://pypi.python.org/pypi/zip_open/ - which looks to do exactly what I want, but it doesn't seem to work for me. (keep getting "[Errno 2] No such file or directory:" for the files I am trying to process, using that module).

Any ideas would be much appreciated!! Thanks in advance

Upvotes: 18

Views: 23989

Answers (7)

Julian
Julian

Reputation: 152

My approach to such a problem is this, includes self-assigned objects:

import os
import re 
import zipfile
import pandas as pd
# import numpy as np
path = r'G:\Important\Data\EKATTE'

# DESCRIBE
archives = os.listdir(path)
archives = [ar for ar in archives if ar.endswith(".zip")]
contents = pd.DataFrame({'elec_date':[],'files':[]})
for a in archives:
    archive = zipfile.ZipFile( path+'\\'+a )
    filelist = archive.namelist()
    # archive.infolist()
    for i in archive.namelist():
        if re.match('.*zip', i):
            sub_arch = zipfile.ZipFile(archive.open(i))
            sub_names = [x for x in sub_arch.namelist()]
            for s in sub_names:
                exec(f"{s.split('.')[0]} = pd.read_excel(sub_arch.open(s), squeeze=True)")

The archive can be found on Bulgaria's National Statistics Institute page (direct link): https://www.nsi.bg/sites/default/files/files/EKATTE/Ekatte.zip

Upvotes: 0

Anqi777
Anqi777

Reputation: 71

This works for me. Just place this script with the nested zip under the same directory. It will also count the total number of files within the nested zip as well

import os

from zipfile import ZipFile


def unzip (path, total_count):
    for root, dirs, files in os.walk(path):
        for file in files:
            file_name = os.path.join(root, file)
            if (not file_name.endswith('.zip')):
                total_count += 1
            else:
                currentdir = file_name[:-4]
                if not os.path.exists(currentdir):
                    os.makedirs(currentdir)
                with ZipFile(file_name) as zipObj:
                    zipObj.extractall(currentdir)
                os.remove(file_name)
                total_count = unzip(currentdir, total_count)
    return total_count

total_count = unzip ('.', 0)
print(total_count)

Upvotes: 4

yutaka Kajiwara
yutaka Kajiwara

Reputation: 111

I use python 3.7.3

import zipfile
import io
with zipfile.ZipFile('all.zip') as z:
    with z.open('nested.zip') as z2:
        z2_filedata =  io.BytesIO(z2.read())
        with zipfile.ZipFile(z2_filedata) as nested_zip:
            print( nested_zip.open('readme.md').read())

Upvotes: 11

ronnydw
ronnydw

Reputation: 953

For those looking for a function that extracts a nested zip file (any level of nesting) and cleans up the original zip files:

import zipfile, re, os

def extract_nested_zip(zippedFile, toFolder):
    """ Unzip a zip file and its contents, including nested zip files
        Delete the zip file(s) after extraction
    """
    with zipfile.ZipFile(zippedFile, 'r') as zfile:
        zfile.extractall(path=toFolder)
    os.remove(zippedFile)
    for root, dirs, files in os.walk(toFolder):
        for filename in files:
            if re.search(r'\.zip$', filename):
                fileSpec = os.path.join(root, filename)
                extract_nested_zip(fileSpec, root)

Upvotes: 8

Matt Faus
Matt Faus

Reputation: 6691

Here's a function I came up with.

def extract_nested_zipfile(path, parent_zip=None):
    """Returns a ZipFile specified by path, even if the path contains
    intermediary ZipFiles.  For example, /root/gparent.zip/parent.zip/child.zip
    will return a ZipFile that represents child.zip
    """

    def extract_inner_zipfile(parent_zip, child_zip_path):
        """Returns a ZipFile specified by child_zip_path that exists inside
        parent_zip.
        """
        memory_zip = StringIO()
        memory_zip.write(parent_zip.open(child_zip_path).read())
        return zipfile.ZipFile(memory_zip)

    if ('.zip' + os.sep) in path:
        (parent_zip_path, child_zip_path) = os.path.relpath(path).split(
            '.zip' + os.sep, 1)
        parent_zip_path += '.zip'

        if not parent_zip:
            # This is the top-level, so read from disk
            parent_zip = zipfile.ZipFile(parent_zip_path)
        else:
            # We're already in a zip, so pull it out and recurse
            parent_zip = extract_inner_zipfile(parent_zip, parent_zip_path)

        return extract_nested_zipfile(child_zip_path, parent_zip)
    else:
        if parent_zip:
            return extract_inner_zipfile(parent_zip, path)
        else:
            # If there is no nesting, it's easy!
            return zipfile.ZipFile(path)

Here's how I tested it:

echo hello world > hi.txt
zip wrap1.zip hi.txt
zip wrap2.zip wrap1.zip
zip wrap3.zip wrap2.zip

print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap1.zip').open('hi.txt').read()
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap2.zip/wrap1.zip').open('hi.txt').read()
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap3.zip/wrap2.zip/wrap1.zip').open('hi.txt').read()

Upvotes: 6

Daniel W. Steinbrook
Daniel W. Steinbrook

Reputation: 196

ZipFile needs a file-like object, so you can use StringIO to turn the data you read from the nested zip into such an object. The caveat is that you'll be loading the full (still compressed) inner zip into memory.

with zipfile.ZipFile('foo.zip') as z:
    with z.open('nested.zip') as z2:
        z2_filedata = cStringIO.StringIO(z2.read())
        with zipfile.ZipFile(z2_filedata) as nested_zip:
            print nested_zip.open('data.txt').read()

Upvotes: 10

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799320

Unfortunately decompressing zip files requires random access to the archive, and the ZipFile methods (not to mention the DEFLATE algorithm itself) only provide streams. It is therefore impossible to decompress nested zip files without extracting them.

Upvotes: 7

Related Questions