edyvedy13
edyvedy13

Reputation: 2296

Unzipping number of files iteratively by using Python

I have 2 TBs of data, and I have to unzip the files to do some analysis. However, due to hard disc space problem, I can not unzip all of the files at once. What I thought is unzipping first two thousand of them first, then doing my analysis and repeating it for the next 2000. How I could do it ?

import os, glob
import zipfile


root = 'C:\\Users\\X\\*'
directory = 'C:\\Users\\X'
extension = ".zip"
to_save = 'C:\\Users\\X\\to_save'

#x = os.listdir(path)[:2000]
for folder in glob.glob(root):
    if folder.endswith(extension): # check for ".zip" extension
        try:
            print(folder)
            os.chdir(to_save)
            zipfile.ZipFile(os.path.join(directory, folder)).extractall(os.path.join(directory, os.path.splitext(folder)[0]))

        except:
            pass

Upvotes: 1

Views: 103

Answers (1)

pstatix
pstatix

Reputation: 3848

What about?:

import os
import glob
import zipfile

root = 'C:\\Users\\X\\*'
directory = 'C:\\Users\\X'
extension = ".zip"
to_save = 'C:\\Users\\X\\to_save'

# list comp of all '.zip' folders
folders = [folder for folder in glob.glob(root) if folder.endswith(extension)]

# only executes while there are folders remaining to be processed
while folders:
    # only grabs the next 2000 folders if there are at least that many
    if len(folders) >= 2000:
        temp = folders[:2000]
    # otherwise gets all the remaining (i.e. 1152 were left)
    else:
        temp = folders[:]

    # list comp that rebuilds with elements not pulled into 'temp'
    folders = [folder for folder in folders if folder not in temp]

    # this was all your code, I just swapped 'x' in place of 'folder'
    for x in temp:
        try:
            print(x)
            os.chdir(to_save)
            zipfile.ZipFile(os.path.join(directory, x)).extractall(os.path.join(directory, os.path.splitext(x)[0]))
        except:
            pass

This makes a temporary list of the .zip's and then removes those elements from the original list. Only drawback is that folders gets modified so eventually it will be empty if you ever needed to use it elsewhere.

Upvotes: 3

Related Questions