Reputation: 4603
I can't understand how the n_jobs works :
data, labels = sklearn.datasets.make_blobs(n_samples=1000, n_features=416, centers=20)
k_means = sklearn.cluster.KMeans(n_clusters=10, max_iter=3, n_jobs=1).fit(data)
runs in less than 1sec
with n_jobs = 2, it runs nearly twice as much
with n_jobs = 8, it is so long it never ended on my computer... (I have 8 cores)
Is there something I don't understand with how parallelization works ?
Upvotes: 1
Views: 11484
Reputation: 336
You can use n_jobs=-1
to use all your CPUs or n_jobs=-2
to use all of them except one.
Update:
n_jobs
parameter is deprecated since version 0.23 and removed in 1.0.
Apparently you can have the same functionality by setting OMP_NUM_THREADS
environment variable (in this case), like:
OMP_NUM_THREADS=4 python my_script.py
Read more about parallelism here: https://scikit-learn.org/stable/computing/parallelism.html#lower-level-parallelism-with-openmp
Upvotes: 5
Reputation: 16876
n_jobs
specifies the number of concurrent processes/threads should be used for parallelized routines
From docs
Some parallelism uses a multi-threading backend by default, some a multi-processing backend. It is possible to override the default backend by using sklearn.utils.parallel_backend.
With python GIL, more threads does not guarantee better speed. So check if your backend is configured for threads or processes. If it is threads then try changing it to processes (but you will also have the overhead of IPC).
Again from the docs:
Whether parallel processing is helpful at improving runtime depends on many factors, and it’s usually a good idea to experiment rather than assuming that increasing the number of jobs is always a good thing. It can be highly detrimental to performance to run multiple copies of some estimators or functions in parallel.
So n_jobs
is not a silver bullet but one has to experiment to see if it works for their estimators and kind of data.
Upvotes: 5