tomriddle_1234
tomriddle_1234

Reputation: 3213

How to use python multi processing optimize zipping file function?

I got a working function for compressing multiple files into one zip file

targetzipfile = os.path.normpath(targetfolder) + '.zip' 
zipf = zipfile.ZipFile(targetzipfile,'w', zipfile.ZIP_DEFLATED, allowZip64=True)

for root, dirs, files in os.walk(targetfolder):
    for f in files:
        #use relative path zipfile.write(absfilename, archivename), the archive name is the name to be shown in the zip file
        print "compressing: %s" % os.path.join(root,f)
        zipf.write(os.path.join(root,f),os.path.relpath(os.path.join(root,f), os.path.dirname(os.path.normpath(targetfolder)))) #Note here maybe a problem, root/f must 
zipf.close()

But it is very slow to run since I got lots of file. So I'm looking for a way to optimize this loop with multi-processing capability in python, like OpenMP.

Thanks for any advice.

Upvotes: 2

Views: 8229

Answers (2)

Leonardo.Z
Leonardo.Z

Reputation: 9791

I doubt that multiprocessing would help here.

The zipfile module in the Python stdlib is not thread safe!!!

Thus, how shall we optimize your code?

ALWAYS profile before and while performing optimizations.

Because I don't know your file structures. I take the python source code for example.

$ time python singleprocess.py
python singleprocess.py  2.31s user 0.22s system 100% cpu 2.525 total

Then, let's try the zip command shipped with Ubuntu.(info-zip).

Your can specify the compression level for the zip command. -1 indicates the fastest compression speed (less compression) and -9 indicates the slowest compression speed. The default compression level is -6.

$ time zip python.zip Python-2.7.6 -r -q
zip python.zip Python-2.7.6 -r -q  2.02s user 0.11s system 99% cpu 2.130 total

$ time zip python.zip Python-2.7.6 -r -q  -1
zip python.zip Python-2.7.6 -r -q -1  1.00s user 0.11s system 99% cpu 1.114 total

$ time zip python.zip Python-2.7.6 -r -q  -9
zip python.zip Python-2.7.6 -r -q -9  4.92s user 0.11s system 99% cpu 5.034 total

You see, the performance of python's zlib module is very competitive. But there are some professional zip tools could give you more control on the compression strategy.

You can call those external command with the subprocess modules in python.

Besides, when you use the python code above to zip a directory, You'll lose the metadata (permission bits, last access time, last modification time ...)of the directory and its subdirectories.

Upvotes: 2

onsy
onsy

Reputation: 750

import multiprocessing
import time

data = (
    List_of_files
)
targetfolder = "targetFolder"
def mp_worker((inputs, targetfolder)):
    print "compressing: %s" % os.path.join(root,f)
    zipf.write(os.path.join(root,inputs),os.path.relpath(os.path.join(root,inputs), os.path.dirname(os.path.normpath(targetfolder)))) #Note here maybe a problem, root/f must 
zipf.close()
    print " Process %s\tDONE" % inputs

def mp_handler():
    p = multiprocessing.Pool(2)
    p.map(mp_worker, data)

if __name__ == '__main__':
    mp_handler()

You can refer - Python Module of the Week

Upvotes: -1

Related Questions