Reputation: 27
I don't know how to parallelise a code in Python that takes each line of a FASTA file and makes some statistics, like compute GC content, of it. Do you have some tips or libraries that will help me to decrease the time spent in execution?
I've tried to use os.fork(), but it gives me more execution time than the sequential code. Probably is due to I don't know very well how to give each child a different sequence.
#Computing GC Content
from Bio import SeqIO
with open('chr1.fa', 'r') as f:
records = list (SeqIO.parse(f,'fasta'))
GC_for_sequence=[]
for i in records:
GC=0
for j in i:
if j in "GC":
GC+=1
GC_for_sequence.append(GC/len(i))
print(GC_for_sequence)
The expected execution would be: Each process takes one sequence, and they do the statistics in parallel.
Upvotes: 2
Views: 946
Reputation: 16204
a few notes on your existing code to start with:
I'd suggest not doing: list (SeqIO.parse(…))
as that will pause execution until all sequences have been loaded in memory, you're much better off (memory and total execution time) just leaving it as an iterator and consuming elements off to workers as needed
looping over each character is pretty slow, using str.count
is going to be much faster
putting this together, you can do:
from Bio import SeqIO
with open('chr1.fa') as fd:
gc_for_sequence=[]
for seq in SeqIO.parse(fd, 'fasta'):
gc = sum(seq.seq.count(base) for base in "GC")
gc_for_sequence.append(gc / len(seq))
if this still isn't fast enough, then you can use the multiprocessing
module like:
from Bio import SeqIO
from multiprocessing import Pool
def sequence_gc_prop(seq):
return sum(seq.count(base) for base in "GC") / len(seq)
with open('chr1.fa') as fd, Pool() as pool:
gc_for_sequence = pool.map(
sequence_gc_prop,
(seq.seq for seq in SeqIO.parse(fd, 'fasta')),
chunksize=1000,
)
comments from Lukasz mostly apply. other non-obvious stuff:
seq.seq for seq in…
stuff is to make sure that we're not Pickling unnecessary datachunksize
to quite a large value because the function should be quick, hence we want to give children a reasonable amount of work to do so the parent process doesn't spend all its time orchestrating thingsUpvotes: 1
Reputation: 11397
Here's one idea with standard multiprocessing module:
from multiprocessing import Pool
import numpy as np
no_cores_to_use = 4
GC_for_sequence = [np.random.rand(100) for x in range(10)]
with Pool(no_cores_to_use) as pool:
result = pool.map(np.average, GC_for_sequence)
print(result)
In the code I used numpy
module to simulate a list with some content. pool.map
takes function that you want to use on your data as first argument and data list as second. The function you can easily define yourself. By default, it should take a single argument. If you want to pass more, then use functools.partial
.
[EDIT] Here's an example much closer to your problem:
from multiprocessing import Pool
import numpy as np
records = ['ACTGTCGCAGC' for x in range(10)]
no_cores_to_use = 4
def count(sequence):
count = sequence.count('GC')
return count
with Pool(no_cores_to_use) as pool:
result = pool.map(count, records)
print(sum(result))
Upvotes: 0