Reputation: 14684
I have a simple set of code that runs Clustal Omega (a protein multiple sequence alignment program) from Python:
from Bio.Align.Applications import ClustalOmegaCommandline
segments = range(1, 9)
segments.reverse()
for segment in segments:
in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment
cline = ClustalOmegaCommandline(infile=in_file,
outfile=out_file,
distmat_out=distmat,
distmat_full=True,
verbose=True,
force=True)
print cline
cline()
I've done some informal tests at timing how long my multiple sequence alignments (MSAs) take. On average, each one takes 4 hours. To run all 8 one after another took me 32 hours in total. Therefore, that was my original intent in running it as a for
loop - that I could let it run and not worry about it.
However, I did yet another informal test - I took the output from the printed cline, and copied-and-pasted it into 8 separate terminal windows spread across two computers, and ran the MSAs that way. On average, each one took about 8 hours or so... but because they were all running at the same time, it took me only 8 hours to get the results.
In some ways, I've discovered parallel processing! :D
But I'm now faced with the dilemma of how to get it running in Python. I've tried looking at the following SO posts, but I still cannot seem to wrap my head around how the multiprocessing module works.
List of posts:
Would anybody be kind enough to share how they would parallelize this loop? Many loops I do look similar to this loop, in which I perform some action on a file and write to another file, without ever needing to aggregate the results in memory. The specific difference I am facing is the need to do file I/O, rather than aggregate results from parallel runs of the loop.
Upvotes: 1
Views: 5131
Reputation: 3923
Possibly the Joblib library is what you are looking for.
Let me give you an example of its use:
import time
from joblib import Parallel, delayed
def long_function():
time.sleep(1)
REPETITIONS = 4
Parallel(n_jobs=REPETITIONS)(
delayed(long_function)() for _ in range(REPETITIONS))
This code runs in 1 second, instead of 4 seconds.
Adapting your code looks like this (sorry, I can't test if this is correct):
from joblib import Parallel, delayed
from Bio.Align.Applications import ClustalOmegaCommandline
def run(segment):
in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment
cline = ClustalOmegaCommandline(infile=in_file,
outfile=out_file,
distmat_out=distmat,
distmat_full=True,
verbose=True,
force=True)
print cline
cline()
if __name__ == "__main__":
segments = range(1, 9)
segments.reverse()
Parallel(n_jobs=len(segments)(
delayed(run)(segment) for segment in segments)
Upvotes: 3
Reputation: 1046
Instead of for segment in segments
, write def f(segment)
and then use multiprocessing.Pool().map(f, segments)
Figuring out how to put this in context is left as an exercise to the reader.
Upvotes: 3