Reputation: 163
I have multiple gz files with a total size of around 120GB. I want to unzip(gzip) those files to the same directory and remove the existing gz file. Currently we are doing it manually and it is taking more time to unzip using gzip -d <filename>
.
Is there a way I can unzip those files in parallel by creating a python script or any other technique. Currently these files are on a Linux machine.
Upvotes: 9
Views: 4679
Reputation: 104514
A large segment of of the wall clock time spent decompressing a file with gunzip
or gzip -d
will be from the I/O operations (reading and writing to disk). It might even be more than the time spent actually decompressing data. You can take advantage of this by having multiple gzip jobs going in the background. As some jobs are blocked on I/O, another job can actually run without having to wait in a queue.
You can speed up the decompressing of the entire file set by having multiple gunzip
processes running in the background. Each serving a specific set of files.
You can whip up something easy in BASH. Split the file list into separate commands and use the &
to start it as a background job. Then wait
for each each job to finish.
I would recommend that you have between 2 to 2*N jobs going at once. Where N is the number of cores or logical processors on your computer. Experiment as appropriate to get the right number.
You can whip something up easy in BASH.
#!/bin/bash
argarray=( "$@" )
len=${#argarray[@]}
#declare 4 empty array sets
set1=()
set2=()
set3=()
set4=()
# enumerate over each argument passed to the script
# and round robin add it to one of the above arrays
i=0
while [ $i -lt $len ]
do
if [ $i -lt $len ]; then
set1+=( "${argarray[$i]}" )
((i++))
fi
if [ $i -lt $len ]; then
set2+=( "${argarray[$i]}" )
((i++))
fi
if [ $i -lt $len ]; then
set3+=( "${argarray[$i]}" )
((i++))
fi
if [ $i -lt $len ]; then
set4+=( "${argarray[$i]}" )
((i++))
fi
done
# for each array, start a background job
gzip -d ${set1[@]} &
gzip -d ${set2[@]} &
gzip -d ${set3[@]} &
gzip -d ${set4[@]} &
# wait for all jobs to finish
wait
In the above example, I picked 4 files per job and started two separate jobs. You can easily extend the script to have more jobs, more files per process, and to take the file names as command line parameters.
Upvotes: 2
Reputation: 17751
You can do this very easily with multiprocessing Pools:
import gzip
import multiprocessing
import shutil
filenames = [
'a.gz',
'b.gz',
'c.gz',
...
]
def uncompress(path):
with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
shutil.copyfileobj(src, dest)
with multiprocessing.Pool() as pool:
for _ in pool.imap_unordered(uncompress, filenames, chunksize=1):
pass
This code will spawn a few processes, and each process will extract one file at a time.
Here I've chosen chunksize=1
, to avoid stalling processes if some files are bigger than average.
Upvotes: 11