Snowcrash
Snowcrash

Reputation: 86387

Joining a Docker swarm

I have 2 VMs.

On the first I run:

docker swarm join-token manager

On the second I run the result from this command.

i.e.

docker swarm join --token SWMTKN-1-0wyjx6pp0go18oz9c62cda7d3v5fvrwwb444o33x56kxhzjda8-9uxcepj9pbhggtecds324a06u 192.168.65.3:2377

However, this outputs:

Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.65.3:2377: connect: connection refused"

Any idea what's going wrong?

If it helps I'm spinning up these VMs using Vagrant.

Upvotes: 3

Views: 10859

Answers (6)

Nirajan Bhattarai
Nirajan Bhattarai

Reputation: 1

vagrant@manager:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:06:38:eb brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic eth0
       valid_lft 84465sec preferred_lft 84465sec
    inet6 fe80::a00:27ff:fe06:38eb/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:39:27:f1 brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.7/24 brd 192.168.56.255 scope global dynamic eth1
       valid_lft 465sec preferred_lft 465sec
    inet6 fe80::a00:27ff:fe39:27f1/64 scope link
       valid_lft forever preferred_lft forever

This generated:

docker swarm join --token SWMTKN-1-35q9xg25chynpxgljdyzd65yjggbe38s5j3kydto6vp6m341fk-cisjtrbdk3mtwgcgxrdywdeng 10.0.2.15:2377 --advertise-addr 10.0.2.15

error:
Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.2.15:2377: connect: connection refused"

So I changed the 10.0.2.15:2377(eth0) to 192.168.56.7(eth1) IP address and tried to connect it as:

docker swarm join --token SWMTKN-1-35q9xg25chynpxgljdyzd65yjggbe38s5j3kydto6vp6m341fk-cisjtrbdk3mtwgcgxrdywdeng 192.168.56.7:2377 --ad
vertise-addr 10.0.2.15

and connection became success.

It seems like the manager node eth0 ip address didn't work out for swarm node connectivity.

Upvotes: -1

Baodi Di
Baodi Di

Reputation: 600

I was facing similar issue. and I spent couple of hours to figure out the root cause and share to those who may have similar issues.

Environment:

  • Oracle Cloud + AWS EC2 (2 +2)
  • OS: 20.04.2-Ubuntu
  • Docker version : 20.10.8
  • 3 dynamic public IP+ 1 elastic IP

Issues

  1. create two instances on the Oracle cloud at beginning
  2. A instance (manager) docker swarm init --advertise-addr success
  3. B instance (worker) docker join as worker is worker success
  4. when I try to promo B as manager, encountered error

Unable to connect to remote host: No route to host 5. mesh routing is not working properly.

Investigation

  1. Suspect it is related to network/firewall/Security group/security list
  2. ssh to B server (worker), telnet (manager) 2377, with same error

Unable to connect to remote host: No route to host 3. login oracle console and add ingress rule under security list for all of relative port TCP port 2377 for cluster management communications

TCP and UDP port 7946 for communication among nodes

UDP port 4789 for overlay network traffic 4. try again but still not work with telnet for same error 5. check the OS level firewall. if has disable it. systemctl ufw disable 6. try again but still not work with same result 7. I suspect there have something wrong with oracle cloud, then I decide try to use AWS install the same version of OS/docker 8. add security group to allow all of relative ports/protocol and disable ufw 9. test with AWS instance C (leader/master) + D (worker). it works and also can promote D to manager. mesh routing was also work. 10. confirm the issue with oracle cloud 11. try to join the oracle instance (A) to C as worker. it works but still cannot promote as manager. 12. use journalctl -f  to investigate the log and confirm there have socket timeout from A/B (oracle instances) to AWS instance(C) 13. relook the A/B, found there have iptables block request 14. remove all of setup in the iptables

# remove the rules
iptables -P INPUT ACCEPT
iptables -P OUTPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -F

 15. remove all of setup in the iptables

Root Cause

It caused by firewall either in cloud security/WAF/ACL level or OS firewall/rules. e.g. ufw/iptables

Upvotes: 3

Joseph Tankoua
Joseph Tankoua

Reputation: 590

For me I was on linux and windows. My windows docker private network was the same as my local network address. So docker daemon wasn't able to find in his own network the master with the address I was giving to him. So I did :

1- go to Docker Desktop app 
2- go to Settings 
3- go to Resources 
4- go to Network section and change the Docker subnet address (need to be different from your local subnet address). 
5- Then apply and restart. 
6- use the docker join on the worker again.

Note: All this steps are performed on the node where the error appear. Make sure that the ports 2377, 7946 and 4789 are opens on the master (you can use iptables or ufw).

Hope it works for you.

Upvotes: 0

user51
user51

Reputation: 10243

I did firewall-cmd --add-port=2377/tcp --permanent firewall-cmd --reload already on master side and was still getting the same error. I did telnet <master ip> 2377 on worker node and then I did reboot on master. Then it is working fine.

Upvotes: 3

Manish R
Manish R

Reputation: 2392

It looks like your docker swarm manager leader is not running on port 2377. You can check it by firing this command on your swarm manager leader vm. If it is working just fine then you will get similar output

[root@host1]# docker node ls
ID                            HOSTNAME                     STATUS              AVAILABILITY        MANAGER STATUS
tilzootjbg7n92n4mnof0orf0 *   host1    Ready               Active              Leader

Furthermore you can check the listening ports in leader swarm manager node. It should have port tcp 2377 for cluster management communications and tcp/udp port 7946 for communication among nodes opened.

[root@host1]# netstat -ntulp | grep dockerd
tcp6       0      0 :::2377                 :::*                    LISTEN      2286/dockerd
tcp6       0      0 :::7946                 :::*                    LISTEN      2286/dockerd
udp6       0      0 :::7946                 :::*                                2286/dockerd

In the second vm where you are configuring second swarm manager you will have to make sure you have connectivity to port 2377 of leader swarm manager. You can use tools like telnet, wget, nc to test the connectivity as given below

[root@host2]# telnet <swarm manager leader ip> 2377
Trying 192.168.44.200...
Connected to 192.168.44.200.

Upvotes: 1

Bharat vyas
Bharat vyas

Reputation: 71

Just add the port to firewall on master side firewall-cmd --add-port=2377/tcp --permanent firewall-cmd --reload

Then again try docker swarm join on second VM or node side

Upvotes: 7

Related Questions