Reputation: 5745
I believe the culprit is the master who does not appear to be listening on port 7946. netstat
shows that 7946 is listening on the nodes, but not the master. When I check the syslogs for the nodes I see the following error
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
I am running a three node Swarm Mode cluster in AWS; one master and two workers. This is swarm mode not to be confused with docker swarm from pre 1.12.
I created all of the services with docker-machine. Each machine is running Ubuntu 15.10 with Docker 1.12.3.
Linux swarm-master-01 4.2.0-42-generic #49-Ubuntu SMP Tue Jun 28 21:26:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Using the master node I have created a service with the following
docker service create --replicas 1 --name myapp -p 3000 myapp
When I run docker service ps myapp
I get the following output
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
02awst8p9pezgpkfzqgz8z79t myapp.1 myapp:latest swarm-node-01 Running Running 19 minutes ago
The running task is deployed to swarm-node-01.
I checked the auto-selected port which was published publicly
$ docker service inspect myapp | jq .[].Endpoint.Ports[].PublishedPort
30000
According to the documentation:
External components, such as cloud load balancers, can access the service on the PublishedPort of any node in the cluster whether or not the node is currently running the task for the service. All nodes in the swarm route ingress connections to a running task instance.
But when I try to curl the nodes who do not have the task running I'm getting connection refused
.
$ curl $(docker-machine ip swarm-node-01):30000/stats
{"uptime":"2016-11-09T14:48:35Z","requestCount":7,"statuses":{"200":7},"pid":1,"open_db_conns":0}
$ curl $(docker-machine ip swarm-node-02):30000/stats
curl: (7) Failed to connect to [the IP] port 30000: Connection refused
note: I scrubbed the IP of node-02
My Troubleshooting:
UPDATE 1
I initialized the swarm with
docker swarm init --advertise-addr 10.0.0.12:2377 --listen-addr 10.0.0.12:2377
I checked the syslogs from the nodes and I'm seeing the following errors
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
I checked to see if the ingress port was listening and it doesn't seem to be
ubuntu@swarm-master-01:~$ sudo lsof -i :7946
ubuntu@swarm-master-01:~$ cat < /dev/tcp/10.0.0.12/7946
-bash: connect: Connection refused
-bash: /dev/tcp/10.0.0.12/7946: Connection refused
ubuntu@swarm-master-01:~$ cat < /dev/tcp/0.0.0.0/7946
-bash: connect: Connection refused
-bash: /dev/tcp/0.0.0.0/7946: Connection refused
Upvotes: 2
Views: 3814
Reputation: 5745
I was able to get around the issue for now, but I don't know what initially caused it. The overlay network (port 7946) wasn't listening on swarm-master-01. I figured this out with netstat -nlt. I searched the syslogs and found these errors related to the port in the syslog.
Nov 8 20:28:20 ubuntu docker[23092]: time="2016-11-08T20:28:20.171385360Z" level=warning msg="2016/11/08 20:28:20 [ERR] memberlist: Failed TCP fallback ping: read tcp 10.0.0.85:54016->10.0.0.13:7946: i/o timeout"
Nov 9 18:26:17 swarm-node-01 docker[714]: time="2016-11-09T18:26:17.573441271Z" level=warning msg="2016/11/09 18:26:17 [ERR] memberlist: Failed to send indirect ping: write udp [::]:7946->10.0.0.38:7946: use of closed network connection"
For some reason docker refused to open this port and listen any more. Here is what I did (albeit undesirable) to circumvent the issue:
Now all of the machines are working as expected except for swarm-master-01. One task is running on swarm-node-01 and curl works against all nodes by forwarding the traffic to the proper container on the proper node. However, swarm-master-01 refuses to listen on the overlay network and curl does not work against this node. I was only able to fix swarm-master-01 by completely removing it from the cluster, restarting the docker daemon, and joining it again as a master. Now 7946 is listening on that machine.
Upvotes: 3