Reputation: 8918
Here is the pstree
output of my current running GridSearch, I am curious to see what processes are going on, and there is something I cannot explain yet.
├─bash─┬─perl───20*[bash───python─┬─5*[python───31*[{python}]]]
│ │ └─11*[{python}]]
│ └─tee
└─bash───pstree
I removed stuff that is unrelated.Curly braces mean threads.
parallel -j 20
to start my python jobs. As you can see, 20*
indeed shows there are 20 processes.bash
process before each of the python processes is due to activation of Anaconda virtual environment with source activate venv
.5*
) spawned. This is because I specified n_jobs=5
to GridSearchCV
.My understanding ends here.
Question: can anyone explain why are there another 11 python threads (11*[{python}]
) along with grid search, and 31 python threads (31*[{python}]
) spawned inside each of the 5 grid search jobs?
Update: added the code for calling GridSearchCV
Cs = 10 ** np.arange(-2, 2, 0.1)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = LogisticRegression()
gs = GridSearchCV(
clf,
param_grid={'C': Cs, 'penalty': ['l1'],
'tol': [1e-10], 'solver': ['liblinear']},
cv=skf,
scoring='neg_log_loss',
n_jobs=5,
verbose=1,
refit=True)
gs.fit(Xs, ys)
Update (2017-09-27):
I wrapped up a test code on gist for you to easily reproduce if interested.
I tested the same code on a Mac Pro and multiple linux machines, and reproduced @igrinis' result, but only on the Mac Pro. On the linux machines, I get different numbers than before, but consistently. So the number of threads spawned may depend on the particular data feed to GridSearchCV.
python─┬─5*[python───31*[{python}]]
└─3*[{python}]
Note that the pstree installed by homebrew/linuxbrew on Mac Pro and linux machines are different. Here I post the exact versions I used:
Mac:
pstree $Revision: 2.39 $ by Fred Hucht (C) 1993-2015
EMail: fred AT thp.uni-due.de
Linux:
pstree (PSmisc) 22.20
Copyright (C) 1993-2009 Werner Almesberger and Craig Small
The Mac version doesn't seem to have an option to show threads, which I thought could be why they are not seen in the result. I haven't found a way to inspect threads on Mac Pro easily yet. If you happen to know a way, please comment.
Update (2017-10-12)
In another set of experiment, I confirmed that setting the environment variable OMP_NUM_THREADS
makes a difference.
Before export OMP_NUM_THREADS=1
, there are many (63 in this case) threads without unclear use spawned as described above:
bash───python─┬─23*[python───63*[{python}]]
└─3*[{python}]
No use of linux parallel
here. n_jobs=23
.
After export OMP_NUM_THREADS=1
, no threads spawned, but the 3 Python processes are still there, whose use I am still unaware of.
bash───python─┬─23*[python]
└─3*[{python}]
I initially came across OMP_NUM_THREADS
because it causes error in some of my GridSearchCV jobs, error messages are something like this
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.
Upvotes: 11
Views: 3771
Reputation: 13676
From sklearn.GridSearchCV
doc:
n_jobs : int, default=1 Number of jobs to run in parallel.
pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
If I understand documentation properly, the GridSearchCV
spawns a bunch of threads as number of grid points, and only runs n_jobs
simultaneously. Number 31 I believe is some kind of cap limit of your 40 possible values. Try to play with value of pre_dispatch
parameter.
Another 11 threads I believe have nothing to do with the GridSearchCV
itself, as it is shown on the same level. I think it is leftovers of other commands.
By the way, I don't observe such behavior on Mac (only see 5 processes spawn by the GridSearchCV
as one would expect) so it may come from incompatible libraries. Try updating sklearn
and numpy
manually.
Here is my pstree
output (part of the path deleted for privacy):
└─┬= 00396 *** -fish
└─┬= 21743 *** python /Users/***/scratch_5.py
├─── 21775 *** python /Users/***/scratch_5.py
├─── 21776 *** python /Users/***/scratch_5.py
├─── 21777 *** python /Users/***/scratch_5.py
├─── 21778 *** python /Users/***/scratch_5.py
└─── 21779 *** python /Users/***/scratch_5.py
answer to the second comment:
That's your code actually. Just generated separable 1d two class problem:
N = 50000
Xs = np.concatenate( (np.random.random(N) , 3+np.random.random(N)) ).reshape(-1, 1)
ys = np.concatenate( (np.zeros(N), np.ones(N)) )
100k samples was enough to get CPU busy for about a minute.
Upvotes: 4