Reputation: 2296
I have 2 TBs of data, and I have to unzip the files to do some analysis. However, due to hard disc space problem, I can not unzip all of the files at once. What I thought is unzipping first two thousand of them first, then doing my analysis and repeating it for the next 2000. How I could do it ?
import os, glob
import zipfile
root = 'C:\\Users\\X\\*'
directory = 'C:\\Users\\X'
extension = ".zip"
to_save = 'C:\\Users\\X\\to_save'
#x = os.listdir(path)[:2000]
for folder in glob.glob(root):
if folder.endswith(extension): # check for ".zip" extension
try:
print(folder)
os.chdir(to_save)
zipfile.ZipFile(os.path.join(directory, folder)).extractall(os.path.join(directory, os.path.splitext(folder)[0]))
except:
pass
Upvotes: 1
Views: 103
Reputation: 3848
What about?:
import os
import glob
import zipfile
root = 'C:\\Users\\X\\*'
directory = 'C:\\Users\\X'
extension = ".zip"
to_save = 'C:\\Users\\X\\to_save'
# list comp of all '.zip' folders
folders = [folder for folder in glob.glob(root) if folder.endswith(extension)]
# only executes while there are folders remaining to be processed
while folders:
# only grabs the next 2000 folders if there are at least that many
if len(folders) >= 2000:
temp = folders[:2000]
# otherwise gets all the remaining (i.e. 1152 were left)
else:
temp = folders[:]
# list comp that rebuilds with elements not pulled into 'temp'
folders = [folder for folder in folders if folder not in temp]
# this was all your code, I just swapped 'x' in place of 'folder'
for x in temp:
try:
print(x)
os.chdir(to_save)
zipfile.ZipFile(os.path.join(directory, x)).extractall(os.path.join(directory, os.path.splitext(x)[0]))
except:
pass
This makes a temporary list of the .zip's and then removes those elements from the original list. Only drawback is that folders
gets modified so eventually it will be empty if you ever needed to use it elsewhere.
Upvotes: 3