Parallelising Python code

Question

I have written a function that returns a Pandas data frame (sample as a row and descriptor as columns) and takes input as a list of peptides (a biological sequence as strings data). "my_function(pep_list)" takes pep_list as a parameter and return data frame. it iterates eache peptide sequence from pep_list and calculates descriptor and combined all the data as pandas data frame and returns df:

pep_list = [DAAAAEF,DAAAREF,DAAANEF,DAAADEF,DAAACEF,DAAAEEF,DAAAQEF,DAAAGEF,DAAAHEF,DAAAIEF,DAAALEF,DAAAKEF]

example:

I want to parallelising this code with the given algorithm bellow:

1. get the number of processor available as .
    n = multiprocessing.cpu_count()

2. split the pep_list  as 
     sub_list_of_pep_list = pep_list/n 

     sub_list_of_pep_list = [[DAAAAEF,DAAAREF,DAAANEF],[DAAADEF,DAAACEF,DAAAEEF],[DAAAQEF,DAAAGEF,DAAAHEF],[DAAAIEF,DAAALEF,DAAAKEF]]

4. run "my_function()" for each core as (example if 4 cores )

     df0 = my_function(sub_list_of_pep_list[0])
     df1 = my_function(sub_list_of_pep_list[1])
     df2 = my_functonn(sub_list_of_pep_list[2])
     df3 = my_functonn(sub_list_of_pep_list[4])

5. join all df = concat[df0,df1,df2,df3] 

6. returns df with nX speed.

Please suggest me the best suitable library to implement this method.

thanks and regards.

Updated

With some reading i am able to write down a code which work as per my expectation like 1. without parallelising it takes ~10 second for 10 peptide sequence 2. with two processes it takes ~6 second for 12 peptide 3. with four processes it takes ~4 second for 12 peptides

from multiprocessing import Process

def func1():
    structure_gen(pep_seq = ["DAAAAEF","DAAAREF","DAAANEF"])

def func2():
    structure_gen(pep_seq = ["DAAAQEF","DAAAGEF","DAAAHEF"])


def func3():
    structure_gen(pep_seq = ["DAAADEF","DAAALEF"])

def func4():
    structure_gen(pep_seq = ["DAAAIEF","DAAALEF"])

if __name__ == '__main__':
  p1 = Process(target=func1)
  p1.start()
  p2 = Process(target=func2)
  p2.start()
  p3 = Process(target=func1)
  p3.start()
  p4 = Process(target=func2)
  p4.start()
  p1.join()
  p2.join()
  p3.join()
  p4.join()

but this code easily work with 10 peptide but not able to implement it for a PEP_list contains 1 million peptide

thanks

Parallelising Python code

Answers (1)

Related Questions