Arcyno
Arcyno

Reputation: 4603

Sklearn kmeans with multiprocessing

I can't understand how the n_jobs works :

data, labels = sklearn.datasets.make_blobs(n_samples=1000, n_features=416, centers=20)
k_means = sklearn.cluster.KMeans(n_clusters=10, max_iter=3, n_jobs=1).fit(data)

runs in less than 1sec

with n_jobs = 2, it runs nearly twice as much

with n_jobs = 8, it is so long it never ended on my computer... (I have 8 cores)

Is there something I don't understand with how parallelization works ?

Upvotes: 1

Views: 11484

Answers (2)

mohammad RaoofNia
mohammad RaoofNia

Reputation: 336

You can use n_jobs=-1 to use all your CPUs or n_jobs=-2 to use all of them except one.

Update: n_jobs parameter is deprecated since version 0.23 and removed in 1.0. Apparently you can have the same functionality by setting OMP_NUM_THREADS environment variable (in this case), like:

OMP_NUM_THREADS=4 python my_script.py

Read more about parallelism here: https://scikit-learn.org/stable/computing/parallelism.html#lower-level-parallelism-with-openmp

Upvotes: 5

mujjiga
mujjiga

Reputation: 16876

n_jobs specifies the number of concurrent processes/threads should be used for parallelized routines

From docs

Some parallelism uses a multi-threading backend by default, some a multi-processing backend. It is possible to override the default backend by using sklearn.utils.parallel_backend.

With python GIL, more threads does not guarantee better speed. So check if your backend is configured for threads or processes. If it is threads then try changing it to processes (but you will also have the overhead of IPC).

Again from the docs:

Whether parallel processing is helpful at improving runtime depends on many factors, and it’s usually a good idea to experiment rather than assuming that increasing the number of jobs is always a good thing. It can be highly detrimental to performance to run multiple copies of some estimators or functions in parallel.

So n_jobs is not a silver bullet but one has to experiment to see if it works for their estimators and kind of data.

Upvotes: 5

Related Questions