Andre Baresel
Andre Baresel

Reputation: 21

docker overlay network problems connecting containers

We are running an environment of 6 engines each with 30 containers. Two engines are running containers with nginx proxy. These two containers are the only way into the network.

It is now the second time that we are facing a major problem with a set of containers in this environment:

Both nginx container cannot reach some of the containers on other machines. Only one physical engine has this problem, all others are fine. It started with timeouts of some machines, and now after 24 hours, all containers on that machine have the problem.

Some more details:

Nginx is running on machine prod-3. Second Nginx is running on machine prod-6. Containers with problems are running on prod-7. Both nginx cannot reach the containers, but the containers can reach the nginx via "ping".

At the beginning and today in the morning we could reach some of the containers, other not. It started with timeouts, now we cannot ping the containers in the overlay network. This time we are able to look at the traffic using tcpdump:

on the nginx container (10.10.0.37 on prod-3) we start a ping and as you can see: 100% packet lost:

root@e89c16296e76:/# ping ew-engine-evwx-intro
PING ew-engine-evwx-intro (10.10.0.177) 56(84) bytes of data.

--- ew-engine-evwx-intro ping statistics ---
8 packets transmitted, 0 received, 100% packet loss, time 7056ms

root@e89c16296e76:/# 

On the target machine prod-7 (not inside the container) - we see that all ping packages are received (so the overlay network is routing correctly to the prod-7):

wurzel@rv_____:~/eventworx-admin$ sudo tcpdump -i ens3 dst port 4789 |grep 10.10.0.177
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
IP 10.10.0.37.35270 > 10.10.0.177.http: Flags [S], seq 2637350294, win 28200, options [mss 1410,sackOK,TS val 1897214191 ecr 0,nop,wscale 7], length 0
IP 10.10.0.37.35270 > 10.10.0.177.http: Flags [S], seq 2637350294, win 28200, options [mss 1410,sackOK,TS val 1897214441 ecr 0,nop,wscale 7], length 0
IP 10.10.0.37.35326 > 10.10.0.177.http: Flags [S], seq 2595436822, win 28200, options [mss 1410,sackOK,TS val 1897214453 ecr 0,nop,wscale 7], length 0
IP 10.10.0.37 > 10.10.0.177: ICMP echo request, id 83, seq 1, length 64
IP 10.10.0.37.35326 > 10.10.0.177.http: Flags [S], seq 2595436822, win 28200, options [mss 1410,sackOK,TS val 1897214703 ecr 0,nop,wscale 7], length 0
IP 10.10.0.37 > 10.10.0.177: ICMP echo request, id 83, seq 2, length 64
IP 10.10.0.37 > 10.10.0.177: ICMP echo request, id 83, seq 3, length 64
IP 10.10.0.37 > 10.10.0.177: ICMP echo request, id 83, seq 4, length 64
IP 10.10.0.37 > 10.10.0.177: ICMP echo request, id 83, seq 5, length 64
IP 10.10.0.37 > 10.10.0.177: ICMP echo request, id 83, seq 6, length 64
IP 10.10.0.37 > 10.10.0.177: ICMP echo request, id 83, seq 7, length 64
IP 10.10.0.37 > 10.10.0.177: ICMP echo request, id 83, seq 8, length 64
^C304 packets captured
309 packets received by filter
0 packets dropped by kernel

wurzel@_______:~/eventworx-admin$ 

At first - you can see that there is no anwer ICMP (firewall is not reponsible, also not appamor).

Inside the responsible container (evwx-intro = 10.10.0.177) nothing is received, the interface eth0 (10.10.0.0) is just silent:

root@ew-engine-evwx-intro:/home/XXXXX# tcpdump -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel
root@ew-engine-evwx-intro:/home/XXXXX# 

It's really strange.

Any other tool from docker which can help us to see what's going on?

We did not change anything to the firewall, also no automatic updates of the system (maybe security).

The only activity was, that some old containers have been reactivated after a long period (of maybe 1-2 month of inactivity).

We are really lost, if you experienced something comparable, it would be very helpful to understand the steps you took.

Many thanks for any help with this.

=============================================================

6 hours later

After trying nearly everything for a full day, we did the final try: (1) stop all the containers (2) stop docker service (3) stop docker socket service (4) restart machine (5) start the containers

... now it looks good at the moment. To conclude: (1) we have no clue what was causing the problem. This is bad. (2) We have learned that the overlay network is not the problem, because the traffic is reaching the target machine where the container is living (3) We are able to trace the network traffic until it reaches the target machine. Somehow it is not "entering" the container. Because inside the container the network interface shows no activity at all.

We have no knowledge about the vxnet virtual network which is used by docker, so if anybody has a hint, could you help us with a link or tool about it?

Many many thanks in advance. Andre

====================================================== 4 days later...

Just had the same situation again after updating docker-ce 18.06 to 18.09.

We have two machines using docker-ce 18 in combination with ubuntu 18.04 and I just updated the docker-ce to 18.09 because of this problem (Docker container should not resolve DNS in Ubuntu 18.04 ... new resolved service).

I stopped all machines, updated docker, restart machine, started all machines.

Problem: Same problem as described in this post. The ping was received by the target host operating system but not forwarded to the container.

Solution: 1. stop all containers and docker 2. consul leave, 3. cleanup all entries in consul keystore on other machines (was not deleted by leave) 3. start consul 4. restart all enigines 5. restart nginx container ... gotcha, network is working now.

Upvotes: 1

Views: 1908

Answers (2)

Artem Sytnyk
Artem Sytnyk

Reputation: 1

I faced the exact issue with overlay network Docker Swarm setup. I've found that it's not OS or Docker problem. Servers affected are using Intel NIC X series. Other servers with I series NIC are working fine. Do you use on-premise servers? Or any cloud provider? We use OVH and it might be caused by some datacenter network misconfiguration.

Upvotes: 0

Andre Baresel
Andre Baresel

Reputation: 21

Once again the same problem was hitting us. We have 7 servers (each running docker as described above), two nginx entry points.

It looks like, that some errors with in the consul key store is the real problem causing the docker network to show the strange behaviour (described above).

In our configuration all 7 server have their own local consul instance which synchronises with the others. For network setup each docker service is doing a lookup at its local consul key store.

In last week we notice that at the same time of the problem with network reachability also the consul clients report problems with synchronisation (leader election problems, repeats etc).

The final solution was to stop the docker engines and the consul clients. Delete the consul database on some servers, join it again to the others. Start the docker engines.

Looks like the consul service is a critical part for the network configuration...

In progress...

Upvotes: 0

Related Questions