Running sk-learn model.predict with python multiprocessing

Question

i have a scikit-learn created model, a huge test dataset to predict. Now to speed up the prediction i want to implement multiprocessing, but really unable to crack it and need help in this regard.

import pandas as pd
from sklearn.externals import joblib
dataset = pd.read_csv('testdata.csv')  # 8mln rows
feature_cols = ['col1', 'col2', 'col3']

#load model
model = joblib.load(model_saved_path)                # random-forest classifier

#predict Function
def predict_func(model, data, feature_cols):
    return model.predict(data[fetaure_cols])

#Normal Execution
predict_vals = predict_func(model, dataset, feature_cols) #130 secs

Now I want to use multiprocessing to predict, (chunk the datset and run predict function on each chunk separately in each core, then join back the result).

But not able to do so.

I have tried

import multiprocessing as mp
def mp_handler():
    p = multiprocessing.Pool(3) #I think it starts 3 processes
    p.map(predict_func, testData) #How to pass parameters
mp_handler()

I have no idea if this is way to do multiprocessing in python(forgive my ignorance here). I have read few search results and came up with this.

If somebody can help in coding, that will be a great help, or a link for read up on multiprocessing will be fair enough. Thanks.

sascha · Accepted Answer

You used a RandomForest (which i would have guessed because of slow prediction).

The takeaway message here is: it's already parallelized (ensemble-level!)! and all your attempts to do it on the outer-level will slow things down!

It's kind of arbitrary how i interpret these levels, but what i mean is:

lowest-level: the core-algorithm is parallel
- Decision-tree is the core of RF; not parallel (in sklearn)!
- affects single-prediction performance
medium-level: the ensemble-algorithm is parallel
- RF = multiple Decision-trees: parallel (in sklearn)!
- affects single-prediction performance
high-level: the batch-prediction is parallel
- This is what you want to do and only makes sense if the lower levels do not exploit your capacities already!
- does not affect single-prediction performance (as you know already)

The general rule is:

if using the correct arguments (e.g. n_jobs=-1; not default!):
- RF will use min(number of cores, n_estimators) cores!
  - Speedup can only be achieved, if the above is lower than your number of cores!

So you should use the right n_jobs argument at training-time to use parallelization. sklearn will use this as explained and it can be seen here.

If you already trained your classifier with n_jobs=1 (not parallel), things get more difficult. It might work out to do:

# untested
model = joblib.load(model_saved_path)
#model.n_jobs = -1                     # unclear if -1 is substituted earlier
model.n_jobs = 4                       # more explicit usage

Keep in mind, that using n_jobs > 1 uses more memory!

Take your favorite OS-monitor, make sure you setup your classifier correctly (parallel -> n_jobs) and observe the CPU-usage during raw prediction. This is not for evaluating the effect of parallelization, but for some indication it is using parallelization!

If you still need parallelization, e.g. when having 32 cores and using n_estimators=10, then use joblib, the multiprocessing-wrapper by sklearn-people used a lot within sklearn. The basic examples should be ready to use!

If this will speed things up will depend on many many things (IO and co).

Running sk-learn model.predict with python multiprocessing

Answers (1)

Related Questions