Reputation: 163

How to unzip multiple gz files in python using multi threading?

I have multiple gz files with a total size of around 120GB. I want to unzip(gzip) those files to the same directory and remove the existing gz file. Currently we are doing it manually and it is taking more time to unzip using gzip -d <filename>.
Is there a way I can unzip those files in parallel by creating a python script or any other technique. Currently these files are on a Linux machine.

Upvotes: 9

Answers (2)

selbie

Reputation: 104589

A large segment of of the wall clock time spent decompressing a file with gunzip or gzip -d will be from the I/O operations (reading and writing to disk). It might even be more than the time spent actually decompressing data. You can take advantage of this by having multiple gzip jobs going in the background. As some jobs are blocked on I/O, another job can actually run without having to wait in a queue.

You can speed up the decompressing of the entire file set by having multiple gunzip processes running in the background. Each serving a specific set of files.

You can whip up something easy in BASH. Split the file list into separate commands and use the & to start it as a background job. Then wait for each each job to finish.

I would recommend that you have between 2 to 2*N jobs going at once. Where N is the number of cores or logical processors on your computer. Experiment as appropriate to get the right number.

You can whip something up easy in BASH.

#!/bin/bash

argarray=( "$@" )
len=${#argarray[@]}

#declare 4 empty array sets
set1=()
set2=()
set3=()
set4=()

# enumerate over each argument passed to the script
# and round robin add it to one of the above arrays

i=0
while [ $i -lt $len ]
do

    if [ $i -lt $len ]; then
        set1+=( "${argarray[$i]}" )
        ((i++))
    fi

    if [ $i -lt $len ]; then
        set2+=( "${argarray[$i]}" )
        ((i++))
    fi

    if [ $i -lt $len ]; then
        set3+=( "${argarray[$i]}" )
        ((i++))
    fi

    if [ $i -lt $len ]; then
        set4+=( "${argarray[$i]}" )
        ((i++))
    fi
done

# for each array, start a background job
gzip -d ${set1[@]} &
gzip -d ${set2[@]} &
gzip -d ${set3[@]} &
gzip -d ${set4[@]} &

# wait for all jobs to finish    
wait

In the above example, I picked 4 files per job and started two separate jobs. You can easily extend the script to have more jobs, more files per process, and to take the file names as command line parameters.

Upvotes: 2

Andrea Corbellini

Reputation: 17781

You can do this very easily with multiprocessing Pools:

import gzip
import multiprocessing
import shutil

filenames = [
    'a.gz',
    'b.gz',
    'c.gz',
    ...
]

def uncompress(path):
    with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
        shutil.copyfileobj(src, dest)

with multiprocessing.Pool() as pool:
    for _ in pool.imap_unordered(uncompress, filenames, chunksize=1):
        pass

This code will spawn a few processes, and each process will extract one file at a time.

Here I've chosen chunksize=1, to avoid stalling processes if some files are bigger than average.

Upvotes: 11

How to unzip multiple gz files in python using multi threading?

Answers (2)

Related Questions