Parallelize for-loop in python

Question

I have a simple set of code that runs Clustal Omega (a protein multiple sequence alignment program) from Python:

from Bio.Align.Applications import ClustalOmegaCommandline

segments = range(1, 9)
segments.reverse()

for segment in segments:
    in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
    out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
    distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment

    cline = ClustalOmegaCommandline(infile=in_file, 
                                    outfile=out_file, 
                                    distmat_out=distmat, 
                                    distmat_full=True, 
                                    verbose=True,
                                    force=True)
    print cline
    cline()

I've done some informal tests at timing how long my multiple sequence alignments (MSAs) take. On average, each one takes 4 hours. To run all 8 one after another took me 32 hours in total. Therefore, that was my original intent in running it as a for loop - that I could let it run and not worry about it.

However, I did yet another informal test - I took the output from the printed cline, and copied-and-pasted it into 8 separate terminal windows spread across two computers, and ran the MSAs that way. On average, each one took about 8 hours or so... but because they were all running at the same time, it took me only 8 hours to get the results.

In some ways, I've discovered parallel processing! :D

But I'm now faced with the dilemma of how to get it running in Python. I've tried looking at the following SO posts, but I still cannot seem to wrap my head around how the multiprocessing module works.

List of posts:

Would anybody be kind enough to share how they would parallelize this loop? Many loops I do look similar to this loop, in which I perform some action on a file and write to another file, without ever needing to aggregate the results in memory. The specific difference I am facing is the need to do file I/O, rather than aggregate results from parallel runs of the loop.

logc · Accepted Answer

Possibly the Joblib library is what you are looking for.

Let me give you an example of its use:

import time
from joblib import Parallel, delayed


def long_function():
    time.sleep(1)


REPETITIONS = 4
Parallel(n_jobs=REPETITIONS)(
    delayed(long_function)() for _ in range(REPETITIONS))

This code runs in 1 second, instead of 4 seconds.

Adapting your code looks like this (sorry, I can't test if this is correct):

from joblib import Parallel, delayed

from Bio.Align.Applications import ClustalOmegaCommandline


def run(segment):
    in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
    out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
    distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment
    cline = ClustalOmegaCommandline(infile=in_file,
                                    outfile=out_file,
                                    distmat_out=distmat,
                                    distmat_full=True,
                                    verbose=True,
                                    force=True)
    print cline
    cline()


if __name__ == "__main__":
    segments = range(1, 9)
    segments.reverse()

    Parallel(n_jobs=len(segments)(
        delayed(run)(segment) for segment in segments)

Parallelize for-loop in python

Answers (2)

Related Questions