Reputation: 3642
I'm doing a Thesis on model assessment techniques for machine learning classification tasks, I'm using some sklearn models, because I can write generic code for the most part, as I have lots of different datasets. One part of Sklearns model output is predict_proba
in which it probability estimates. For large datasets with lots of datapoints, to compute the predict_proba
for each datapoint takes a long time. I loaded up htop
and saw python only using a single core for the calculations, so I wrote out the following function:
from joblib import Parallel, delayed
import multiprocessing
num_cores = multiprocessing.cpu_count()
def makeprob(r,first,p2,firstm):
reshaped_r = first[r].reshape(1,p2)
probo = clf.predict_proba(reshaped_r)
probo = probo.max()
print('Currently at %(perc)s percent' % {'perc': (r/firstm)*100})
return probo
# using multiple cores to run the function 'makeprob'
results = Parallel(n_jobs=num_cores)(delayed(makeprob)(r,first,p2,firstm) for r in range(firstm))
Now I see with htop
all cores being used, and the speed up is significant, but not near as fast as I would like, if anybody knows of a way to speed this up or point me in the right direction as to get faster computation gains in this scenario that would be great.
Upvotes: 2
Views: 411
Reputation: 6100
The loss of performance depends on three elements:
Upvotes: 1