Vinh Tran
Vinh Tran

Reputation: 189

DASK SSH Cluster in Jupyter Notebook

UPDATE I have copied over the SSH keys to all my machines and they are able to communicate without a password, however I still need to specify the username@hostname instead of just the hostname. I tried many different methods with no luck: Method 1: I input the following on my jupyter notebook:

from dask.distributed import Client, SSHCluster
cluster = SSHCluster(
["localhost", username@hostname],
connect_options={"known_hosts": None},
worker_options={"nthreads": 2},
 )
client = Client(cluster)

I understand the connect_options is what the asyncio library does to connect ssh so I thought known_hosts is ok since it looks like the authorized keys in my .ssh directory. However, I keep getting the following error:

 ~/anaconda3/lib/python3.7/concurrent/futures/thread.py in run(self)
 55 
 56         try:
 57             result = self.fn(*self.args, **self.kwargs)
 58         except BaseException as exc:
 59             self.future.set_exception(exc)

~/anaconda3/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, 
proto, flags)
750     # and socket type values to enum constants.
751     addrlist = []
--> 752     for res in _socket.getaddrinfo(host, port, family, type, proto, 
flags):
753         af, socktype, proto, canonname, sa = res
754         addrlist.append((_intenum_converter(af, AddressFamily),

gaierror: [Errno -2] Name or service not known

The second method I tried was dask-ssh which I typed the following in the command line:

dask-ssh localhost username@hostname username@hostnameb --nprocs 10

However, when I open the dashboard I don't see anything from the worker nodes from the remote machines in the dashboard, only the 10 workers from the localhost.

Please help, I read tutorials, looked at Stack Overflow, I even tried Kubernetes (microk8s, k3s, minikube, kubeadm) and Apache Hadoop/Yarn with many many hours failed results and dask ssh seems to be my only hope. I also like Dask because the dashboard looks better than Hadoop (that yellow elephant kinda bugs me).

PREVIOUS I'm trying to create a DASK cluster between my machines at home using Jupyter Notebook. I understand the concept behind schedulers, workers and clients. On the Dask Docs, they provided the following example which I'm having a hard time figuring out how to make it work:

from dask.distributed import Client,SSHCluster
cluster = SSHCluster(
          [["localhost", "localhost", "localhost", "localhost"],
           connect_options={"known_hosts": None},
           worker_options={"nthreads": 2},
           scheduler_options={"port": 0, "dashboard_address": ":8797"}
           client = Client(cluster)

My question is how do I configure SSHCluster so I can create a cluster between different machines. How do I set the IP address, username, and password? I understand there are better options out there like Hadoop/Yarn, Kubernetes etc, but I wanted to understand the SSH cluster concept through Jupyter Notebook.

Thanks,

Upvotes: 1

Views: 2133

Answers (1)

mdurant
mdurant

Reputation: 28673

The documentation tells you what to do.

How do I set the IP address, username, and password?

  • replace the list of "localhost"s with the names or IP addresses of the machines you want to connect to. You must be able to log into each with SSH without username/password, and they must all have identical python environments set up
  • do not try to use username/password, set up key-based auth; there are many ways to do this, so pick a simple one

SSH cluster concept through Jupyter Notebook

Use of a notebook is immaterial here, you are executing python just the same.

there are better options out there like Hadoop/Yarn, Kubernetes

Many many people use SSH, because it is very simple, but it does leave you to manage any orchestration (e.g., making sure machines are on the same network and can communicate, and managing environments).

-EDIT-

(to updated question)

Reading the asyncSSH documentation, you want to pass an option called username= on connect_option (see here). asyncssh does not currently support using a ~/.ssh config file to define targets, unfortunately, so if you have different options for each server, you are out of luck.

Note that if you are doing something very custom, you do not need to use dask-ssh at all, you can login and run dask explicitly on each server.

Upvotes: 2

Related Questions