Is there any downside in using multiple "n_jobs=-1" statements?

Question

In the context of model selection for a classification problem, while running cross validation, is it ok to specify n_jobs=-1 both in model specification and cross validation function in order to take full advantage of the power of the machine?

For example, comparing sklearn RandomForestClassifier and xgboost XGBClassifier:

RF_model = RandomForestClassifier( ..., n_jobs=-1)
XGB_model = XGBClassifier( ..., n_jobs=-1)

RF_cv = cross_validate(RF_model, ..., n_jobs=-1)
XGB_cv = cross_validate(XGB_model, ..., n_jobs=-1)

is it ok to specify the parameters in both? Or should I specify it only once? And in which of them, model or cross validation statement?

I used for the example models from two different libraries (sklearn and xgboost) because maybe there is a difference in how it works, also cross_validate function is from sklearn.

Nick ODell · Accepted Answer

Specifying n_jobs twice does have an effect, though whether it has a positive or negative effect is complicated.

When you specify n_jobs twice, you get two levels of parallelism. Imagine you have N cores. The cross-validation function creates N copies of your model. Each model creates N threads to run fitting and predictions. You then have N*N threads.

This can blow up pretty spectacularly. I once worked on a program which needed to apply ARIMA to tens of thousands of time-series. Since each ARIMA is independent, I parallelized it and ran one ARIMA on each core of a 12-core CPU. I ran this, and it performed very poorly. I opened up htop, and was surprised to find 144 threads running. It turned out that this library, pmdarima, internally parallelized ARIMA operations. (It doesn't parallelize them well, but it does try.) I got a massive speedup just by turning off this inner layer of parallelism. Having two levels of parallelism is not necessarily better than having one.

In your specific case, I benchmarked a random forest with cross validation, and I benchmarked four configurations:

No parallelism
Parallelize across different CV folds, but no model parallelism
Parallelize within the model, but not on CV folds
Do both

_{(Error bars represent 95% confidence interval. All tests used RandomForestClassifier. Test was performed using cv=5, 100K samples, and 100 trees. Test system had 4 cores with SMT disabled. Scores are mean duration of 7 runs.)}

This graph shows that no parallelism is the slowest, CV parallelism is third fastest, and model parallelism and combined parallelism are tied for first place.

However, this is closely tied to what classifiers I'm using - a benchmark for pmdarima, for example, would find that cross-val parallelism is faster than model parallelism or combined parallelism. If you don't know which one is faster, then test it.

Is there any downside in using multiple "n_jobs=-1" statements?

Answers (1)

Related Questions

Is there any downside in using multiple &quot;n_jobs=-1&quot; statements?

Answers (1)

Related Questions

Is there any downside in using multiple "n_jobs=-1" statements?