Reputation: 871
I am running a parallelized grid search in python using joblib.Parallel
. My script is relatively straighforward:
# Imports for data and classes...
Parallel(n_jobs=n_jobs)(
delayed(biz_model)(
...
)
for ml_model_params in grid
for past_horizon in past_horizons
)
When I run it in my local machine, it seems to run fine though I can only test it on small datasets for memory reasons. Yet when I try to run it on a remote Oracle Linux server It begins some runs and after a while it outputs:
/u01/.../resources/python/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
Aborted!
I tried to reproduce it locally, and with small experiments it does run. The unparallelized script also runs and the number of jobs (either low on high) doesn't prevent the bug from happening.
So my question is, given that there is no traceback, is there a way to make joblib
or Parallel
more verbose? I just cannot quite get an idea of where to look at possible fail reasons without a traceback. Obviously if some possible reason for the abort can be inferred from just this (and I fail to grasp it) I thank the notice very much.
Thanks in advance.
Upvotes: 3
Views: 2289
Reputation: 1459
Using a logger, catching the exception, logging it, flushing the logs and raising it again, usually makes the trick
# Imports for data and classes...
# Creates logger
Parallel(n_jobs=n_jobs)(
try:
delayed(biz_model)(
...
)
for ml_model_params in grid
for past_horizon in past_horizons
except BaseException as e:
logger.exception(e)
# you can use a for here if you had more than a handler
logger.handlers[0].flush()
raise e
)
Upvotes: 2