Reputation: 91
I am running Dask on a SLURM-managed cluster.
dask-ssh --nprocs 2 --nthreads 1 --scheduler-port 8786 --log-directory `pwd` --hostfile hostfile.$JOBID &
sleep 10
# We need to tell dask Client (inside python) where the scheduler is running
scheduler="`hostname`:8786"
echo "Scheduler is running at ${scheduler}"
export ARL_DASK_SCHEDULER=${scheduler}
echo "About to execute $CMD"
eval $CMD
# Wait for dash-ssh to be shutdown from the python
wait %1
I create a Client inside my python code and then when finished, I shut it down.
c=Client(scheduler_id)
...
c.shutdown()
My reading of the dask-ssh help is that the shutdown will shutdown all workers and then the scheduler. But it does not stop the background dask-ssh and so eventually the job timeouts.
I've tried this interactively in the shell. I cannot see how to stop the scheduler.
I would appreciate any help.
Thanks, Tim
Upvotes: 3
Views: 1517
Reputation: 57291
First, when setting up with SLURM you might consider using the --scheduler-file
option, which allows you to coordinate the scheduler address using your NFS (which I assume you have given that you're using SLURM). Recommend reading this doc section: http://distributed.readthedocs.io/en/latest/setup.html#using-a-shared-network-file-system-and-a-job-scheduler
dask-scheduler --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json
>>> client = Client(scheduler_file='/path/to/scheduler.json')
Given this it also becomes easier to use the sbatch or qsub command directly. Here is an example with SGE's qsub
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json
It looks like client.shutdown only shuts down the client. You're correct that this is inconsistent with the docstring. I've raised an issue here: https://github.com/dask/distributed/issues/1085 for tracking further developments.
These three commands should suffice to tear down the workers, close the scheduler, and stop the scheduler process
client.loop.add_callback(client.scheduler.retire_workers, close_workers=True)
client.loop.add_callback(client.scheduler.terminate)
client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.loop.stop())
Typically people start and stop clusters with whatever means that they started them. This might involve using SLURM's kill command. We should make the client-focused way more consistent though regardless.
Upvotes: 4