Reputation: 5907
i have a scikit-learn created model, a huge test dataset to predict. Now to speed up the prediction i want to implement multiprocessing, but really unable to crack it and need help in this regard.
import pandas as pd
from sklearn.externals import joblib
dataset = pd.read_csv('testdata.csv') # 8mln rows
feature_cols = ['col1', 'col2', 'col3']
#load model
model = joblib.load(model_saved_path) # random-forest classifier
#predict Function
def predict_func(model, data, feature_cols):
return model.predict(data[fetaure_cols])
#Normal Execution
predict_vals = predict_func(model, dataset, feature_cols) #130 secs
Now I want to use multiprocessing to predict, (chunk the datset and run predict function on each chunk separately in each core, then join back the result).
But not able to do so.
I have tried
import multiprocessing as mp
def mp_handler():
p = multiprocessing.Pool(3) #I think it starts 3 processes
p.map(predict_func, testData) #How to pass parameters
mp_handler()
I have no idea if this is way to do multiprocessing in python(forgive my ignorance here). I have read few search results and came up with this.
If somebody can help in coding, that will be a great help, or a link for read up on multiprocessing will be fair enough. Thanks.
Upvotes: 1
Views: 5267
Reputation: 33532
You used a RandomForest (which i would have guessed because of slow prediction).
The takeaway message here is: it's already parallelized (ensemble-level!)! and all your attempts to do it on the outer-level will slow things down!
It's kind of arbitrary how i interpret these levels, but what i mean is:
The general rule is:
n_jobs=-1
; not default!):
min(number of cores, n_estimators)
cores!
So you should use the right n_jobs
argument at training-time to use parallelization. sklearn will use this as explained and it can be seen here.
If you already trained your classifier with n_jobs=1
(not parallel), things get more difficult. It might work out to do:
# untested
model = joblib.load(model_saved_path)
#model.n_jobs = -1 # unclear if -1 is substituted earlier
model.n_jobs = 4 # more explicit usage
Keep in mind, that using n_jobs > 1
uses more memory!
Take your favorite OS-monitor, make sure you setup your classifier correctly (parallel -> n_jobs) and observe the CPU-usage during raw prediction. This is not for evaluating the effect of parallelization, but for some indication it is using parallelization!
If you still need parallelization, e.g. when having 32 cores and using n_estimators=10
, then use joblib, the multiprocessing-wrapper by sklearn-people used a lot within sklearn. The basic examples should be ready to use!
If this will speed things up will depend on many many things (IO and co).
Upvotes: 3