JonnDough
JonnDough

Reputation: 897

Parallelization with multiprocessing, joblib or multiprocess is not working

There is this stackoverflow post that really nicely shows a way to calculate the proximity matrix of a RandomForestClassifier().

Proximity Matrix in sklearn.ensemble.RandomForestClassifier

Nevertheless the for-loop in that script is quite slow if you have a large dataframe. I tried to parallelize this for-loop, but unsuccesfully. I only get 'None' as an output.

How can I parallelize this for-loop in Spyder 4 running Python 3.8.5 on Windows 10?

proxMat = 1*np.equal.outer(a, a)

for i in range(1, nTrees):
      a = terminals[:,i]
      proxMat += 1*np.equal.outer(a, a)

Upvotes: 2

Views: 2617

Answers (1)

SergeD
SergeD

Reputation: 54

Here you want to perform a reduce operation - so parrallelization is not obvious. You did not specify how you tried to parallelize the loop. A simple way to parrallelize :

import multiprocessing
pool = multiprocessing.Pool(processes=4)

def get_outer(i):
   return np.equal.outer(terminals[:,i],terminals[:,i])

todo = list(range(1, nTrees))
results = pool.map(get_outer, todo)
proxMat = 1*np.equal.outer(a, a)
for res in results:
    proxMat+ = res

I'm not sure this one would help, but possibly you'd have less pickling problems :

import multiprocessing
pool = multiprocessing.Pool(processes=4)

def get_outer(t):
   return np.equal.outer(t,t)

# This part might be costly !
terms = [terminals[:,i] for i in range(1, nTrees)]

results = pool.map(get_outer, terms)
proxMat = 1*np.equal.outer(a, a)
for res in results:
    proxMat+ = res

Upvotes: 1

Related Questions