Reputation: 989
I am running a double nested loop over i,j
and I use sklearn's PCA function inside the inner loop. Though I am not using any parallel processing packages, the task manager is telling me that all my CPU's are running between 80%-100%. I am pleasantly surprised by this, and have 2 questions:
1) What is going on here? How did python decide to use multiple CPU's? How is it breaking up the loop? Printing out the i,j
values, they are still being completed in order.
2) Would the code be sped even more up by explicitly parallelizing it with a package, or will the difference be negligible?
Upvotes: 1
Views: 1057
Reputation: 4275
One explanation, therefore, is that somewhere in your code n_jobs
is a valid argument for an sklearn process. I'm a bit confused though, because only the specialized PCA tools have that argument in the docs.
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html (No n_jobs
)
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html (Has n_jobs
)
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.MiniBatchSparsePCA.html (Has n_jobs
)
Numpy may also be the culprit, but you would have to dig into the implementation a bit to begin examining where sklearn
is making use of numpy
parallel tools.
Sklearn has a landing page specifically for optimizing existing sklearn
tools (and writing your own tools.) They offer a variety of suggestions and specifically mention joblib
.
Check it out
Upvotes: 1