Reputation: 69
In the context of model selection for a classification problem, while running cross validation, is it ok to specify n_jobs=-1
both in model specification and cross validation function in order to take full advantage of the power of the machine?
For example, comparing sklearn RandomForestClassifier and xgboost XGBClassifier:
RF_model = RandomForestClassifier( ..., n_jobs=-1)
XGB_model = XGBClassifier( ..., n_jobs=-1)
RF_cv = cross_validate(RF_model, ..., n_jobs=-1)
XGB_cv = cross_validate(XGB_model, ..., n_jobs=-1)
is it ok to specify the parameters in both? Or should I specify it only once? And in which of them, model or cross validation statement?
I used for the example models from two different libraries (sklearn and xgboost) because maybe there is a difference in how it works, also cross_validate
function is from sklearn.
Upvotes: 3
Views: 1106
Reputation: 25259
Specifying n_jobs
twice does have an effect, though whether it has a positive or negative effect is complicated.
When you specify n_jobs
twice, you get two levels of parallelism. Imagine you have N cores. The cross-validation function creates N copies of your model. Each model creates N threads to run fitting and predictions. You then have N*N threads.
This can blow up pretty spectacularly. I once worked on a program which needed to apply ARIMA to tens of thousands of time-series. Since each ARIMA is independent, I parallelized it and ran one ARIMA on each core of a 12-core CPU. I ran this, and it performed very poorly. I opened up htop
, and was surprised to find 144 threads running. It turned out that this library, pmdarima, internally parallelized ARIMA operations. (It doesn't parallelize them well, but it does try.) I got a massive speedup just by turning off this inner layer of parallelism. Having two levels of parallelism is not necessarily better than having one.
In your specific case, I benchmarked a random forest with cross validation, and I benchmarked four configurations:
(Error bars represent 95% confidence interval. All tests used RandomForestClassifier. Test was performed using cv=5, 100K samples, and 100 trees. Test system had 4 cores with SMT disabled. Scores are mean duration of 7 runs.)
This graph shows that no parallelism is the slowest, CV parallelism is third fastest, and model parallelism and combined parallelism are tied for first place.
However, this is closely tied to what classifiers I'm using - a benchmark for pmdarima, for example, would find that cross-val parallelism is faster than model parallelism or combined parallelism. If you don't know which one is faster, then test it.
Upvotes: 4