Alex P
Alex P

Reputation: 137

python, how to reference a data_frame row to execute in parallel

I have written a simple for loop that iterates per rows of a data frame. Each row of a data frame is compared with all entries of the second data frame (second input of that function).

Now this function after some comparisons and searching returns back rows of a new dataframe that has the following structure.

new_df=pd.DataFrame(columns=['1','2','3','4','5','6','dist','unique','occurence','timediff','id'], dtype='float')

The for loop now looks like this:

for i in range(0,small_pd.shape[0]):
    new_df=new_df.append(SequencesExtractTime(small_pd.loc[i],large_pd.loc[i]) )

I am trying to find a way to run this code in parallel since it take years to execute on a single core.

I have found the joblib package

from joblib import Parallel, delayed
import multiprocessing

num_cores = multiprocessing.cpu_count()
print(Parallel(n_jobs=num_cores)(SequencesExtractTime(small_pd,large_pd)(i) for i in range(0,small_pd.shape[0])))

The problem now is how to use the two data frames properly so can be used from the parallel loop. I think the problem is that I do not know how to write the input arguments from the form of I had them in the for loop

small_pd.loc[i]

in a form for the Parallel function.

Can you please help me with this problem? Thanks Alex

Upvotes: 0

Views: 52

Answers (1)

Joe
Joe

Reputation: 889

Are your DataFrames contain >1m rows? If so, doing crude loops even in parallel will take a toll on memory.

If you really need to compare each column entry from 1st df to the 2nd df. Try to consider list parsing instead.

That way you can take advantage .intersection() or .difference() whichever suits your need for filtering documentation here.

Or try groupby() from pandas

Upvotes: 1

Related Questions