python, how to reference a data_frame row to execute in parallel

Question

I have written a simple for loop that iterates per rows of a data frame. Each row of a data frame is compared with all entries of the second data frame (second input of that function).

Now this function after some comparisons and searching returns back rows of a new dataframe that has the following structure.

new_df=pd.DataFrame(columns=['1','2','3','4','5','6','dist','unique','occurence','timediff','id'], dtype='float')

The for loop now looks like this:

for i in range(0,small_pd.shape[0]):
    new_df=new_df.append(SequencesExtractTime(small_pd.loc[i],large_pd.loc[i]) )

I am trying to find a way to run this code in parallel since it take years to execute on a single core.

I have found the joblib package

from joblib import Parallel, delayed
import multiprocessing

num_cores = multiprocessing.cpu_count()
print(Parallel(n_jobs=num_cores)(SequencesExtractTime(small_pd,large_pd)(i) for i in range(0,small_pd.shape[0])))

The problem now is how to use the two data frames properly so can be used from the parallel loop. I think the problem is that I do not know how to write the input arguments from the form of I had them in the for loop

small_pd.loc[i]

in a form for the Parallel function.

Can you please help me with this problem? Thanks Alex

python, how to reference a data_frame row to execute in parallel

Answers (1)

Related Questions