Reputation: 1307
I want to parallelize my model-building-procedure using scikit-learn. I wonder if it makes sense to parallelize both the outer and inner loop ( i.e. setting n_jobs = -1
both for GridSearchCV
and for cross_validate
)?
Upvotes: 2
Views: 309
Reputation: 1
A longer version will need to take a bit of understanding, how the n_jobs
are actually being handled.
Having a few, expensive, resources ( right, the CPU-cores per se, the fastest and the most expensive CPU-core-local Cache hierarchy elements ( not going as deep to study cache-lines and their respective associativity at this level ) and the less expensive and also way slower RAM-memory ), the n_jobs = -1
directive, in the first call-signature executed, will simply grab all these resources at once.
That means, there will be no reasonably "free" resources for any "deeper" levels of attempt to use -again- "as many resources" as physically available ( which the n_jobs = -1
does and obeys that again, but having no "free" left unharnessed from the first one, there will become just a wreck havoc in scheduling attempts to map/evict/map/evict/map/evict thus more processing jobs on the same real ( and already pretty busy ) hardware elements ).
Often even the first attempt may create troubles on the RAM-allocations side, as large models will require that many replications in all the RAM-data-structures during the process instantiations ( a whole copy is effectively made with all objects, used or not used, replicated into each new process ), as the number of CPU-cores "dictates". Resulting memory swaps are definitely a thing you will never like to repeat.
Enjoy the model HyperParameters' tuning - it is the Creame a la Creame of the Machine Learning practice. Worth being good at.
Upvotes: 2