Reputation: 897
There is this stackoverflow post that really nicely shows a way to calculate the proximity matrix of a RandomForestClassifier()
.
Proximity Matrix in sklearn.ensemble.RandomForestClassifier
Nevertheless the for-loop in that script is quite slow if you have a large dataframe. I tried to parallelize this for-loop, but unsuccesfully. I only get 'None' as an output.
How can I parallelize this for-loop in Spyder 4 running Python 3.8.5 on Windows 10?
proxMat = 1*np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
Upvotes: 2
Views: 2617
Reputation: 54
Here you want to perform a reduce operation - so parrallelization is not obvious. You did not specify how you tried to parallelize the loop. A simple way to parrallelize :
import multiprocessing
pool = multiprocessing.Pool(processes=4)
def get_outer(i):
return np.equal.outer(terminals[:,i],terminals[:,i])
todo = list(range(1, nTrees))
results = pool.map(get_outer, todo)
proxMat = 1*np.equal.outer(a, a)
for res in results:
proxMat+ = res
I'm not sure this one would help, but possibly you'd have less pickling problems :
import multiprocessing
pool = multiprocessing.Pool(processes=4)
def get_outer(t):
return np.equal.outer(t,t)
# This part might be costly !
terms = [terminals[:,i] for i in range(1, nTrees)]
results = pool.map(get_outer, terms)
proxMat = 1*np.equal.outer(a, a)
for res in results:
proxMat+ = res
Upvotes: 1