user3433489
user3433489

Reputation: 989

Python Using Multiple Cores Without Me Asking

I am running a double nested loop over i,j and I use sklearn's PCA function inside the inner loop. Though I am not using any parallel processing packages, the task manager is telling me that all my CPU's are running between 80%-100%. I am pleasantly surprised by this, and have 2 questions:

1) What is going on here? How did python decide to use multiple CPU's? How is it breaking up the loop? Printing out the i,j values, they are still being completed in order.

2) Would the code be sped even more up by explicitly parallelizing it with a package, or will the difference be negligible?

Upvotes: 1

Views: 1057

Answers (1)

Charles Landau
Charles Landau

Reputation: 4275

"Several scikit-learn tools... rely internally on Python’s multiprocessing module to parallelize execution onto several Python processes by passing n_jobs > 1 as argument."

One explanation, therefore, is that somewhere in your code n_jobs is a valid argument for an sklearn process. I'm a bit confused though, because only the specialized PCA tools have that argument in the docs.

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html (No n_jobs)

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html (Has n_jobs)

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.MiniBatchSparsePCA.html (Has n_jobs)

Numpy may also be the culprit, but you would have to dig into the implementation a bit to begin examining where sklearn is making use of numpy parallel tools.

Sklearn has a landing page specifically for optimizing existing sklearn tools (and writing your own tools.) They offer a variety of suggestions and specifically mention joblib. Check it out

Upvotes: 1

Related Questions