Tim Cornwell
Tim Cornwell

Reputation: 91

Stopping dask-ssh created scheduler from the Client interface

I am running Dask on a SLURM-managed cluster.

dask-ssh --nprocs 2 --nthreads 1 --scheduler-port 8786 --log-directory `pwd` --hostfile hostfile.$JOBID &
sleep 10

# We need to tell dask Client (inside python) where the scheduler is running
scheduler="`hostname`:8786"
echo "Scheduler is running at ${scheduler}"
export ARL_DASK_SCHEDULER=${scheduler}

echo "About to execute $CMD"

eval $CMD

# Wait for dash-ssh to be shutdown from the python
wait %1

I create a Client inside my python code and then when finished, I shut it down.

c=Client(scheduler_id)
...
c.shutdown()

My reading of the dask-ssh help is that the shutdown will shutdown all workers and then the scheduler. But it does not stop the background dask-ssh and so eventually the job timeouts.

I've tried this interactively in the shell. I cannot see how to stop the scheduler.

I would appreciate any help.

Thanks, Tim

Upvotes: 3

Views: 1517

Answers (1)

MRocklin
MRocklin

Reputation: 57291

Recommendation with --scheduler-file

First, when setting up with SLURM you might consider using the --scheduler-file option, which allows you to coordinate the scheduler address using your NFS (which I assume you have given that you're using SLURM). Recommend reading this doc section: http://distributed.readthedocs.io/en/latest/setup.html#using-a-shared-network-file-system-and-a-job-scheduler

dask-scheduler --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json

>>> client = Client(scheduler_file='/path/to/scheduler.json')

Given this it also becomes easier to use the sbatch or qsub command directly. Here is an example with SGE's qsub

# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json

# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json

Client.shutdown

It looks like client.shutdown only shuts down the client. You're correct that this is inconsistent with the docstring. I've raised an issue here: https://github.com/dask/distributed/issues/1085 for tracking further developments.

In the meantime

These three commands should suffice to tear down the workers, close the scheduler, and stop the scheduler process

client.loop.add_callback(client.scheduler.retire_workers, close_workers=True)
client.loop.add_callback(client.scheduler.terminate)
client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.loop.stop())

What people usually do

Typically people start and stop clusters with whatever means that they started them. This might involve using SLURM's kill command. We should make the client-focused way more consistent though regardless.

Upvotes: 4

Related Questions