Reputation: 21
I have a single container deployed via docker-compose (dns is done through the docker daemon dns server 127.0.0.11) on a host with dns server configured for a private network in the /etc/resolv.conf
and no access to the internet.
The container runs fine for a while (about 40 hours) then starts failing its dns lookups with timeout messages: the application logs show failures against the docker dns server:
Caused by: java.net.UnknownHostException: failed to resolve 'alfresco.test.duf'
at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1013)
at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:966)
at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:414)
at io.netty.resolver.dns.DnsResolveContext.access$600(DnsResolveContext.java:63)
at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:463)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:225)
at io.netty.resolver.dns.DnsQueryContext$4.run(DnsQueryContext.java:177)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/127.0.0.11:53] query via UDP timed out after 5000 milliseconds (no stack trace available)
The docker daemon log shows failures against the local network dns server:
Aug 25 12:19:15 st2510v dockerd[6749]: time="2021-08-25T12:19:15.066556867+02:00" level=warning msg="[resolver] connect failed: dial udp 157.164.138.33:53: connect: resource temporarily unavailable"
Pinging the target server from the docker host resolves correctly.
Starting a bash container in the docker-network (created through compose) and pinging the target server from there resolves correctly.
Pinging any server (external dns, docker dns, bashcontainer) form within the problematic container fails to resolve.
The container does not recover from the error on its own.
Restarting or recreating the container does fix the issue.
Ive compared the host iptables and network interfaces with a working instance that does not have the issue at all, but this did not yield any significant differences.
Any advice on what the issue is, or how to diagnose what it might be?
Docker version output:
[al6735@st2510v ~]$ sudo docker version
Client: Docker Engine - Community
Version: 19.03.5
API version: 1.40
Go version: go1.12.12
Git commit: 633a0ea
Built: Wed Nov 13 07:25:41 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea
Built: Wed Nov 13 07:24:18 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
Docker info output:
[al6735@st2510v ~]$ sudo docker info
Client:
Debug Mode: false
Server:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 3
Server Version: 19.03.5
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-957.21.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.51GiB
Name: st2510v
ID: KTEE:M3ZD:5ZS5:DVFU:R6VJ:YV7Q:QPP5:D4YG:ITV7:YC3U:YP3J:AEDG
Docker Root Dir: /home/docker
Debug Mode: true
File Descriptors: 38
Goroutines: 48
System Time: 2021-09-24T14:23:42.314595155+02:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Upvotes: 0
Views: 1424
Reputation: 21
Further inspection of the host showed that the java application in the target container was holding a lot of tcp sockets.
After fixing the above, the connection issue did not occur any more. Presumably we hit a limit on the amount of open sockets a container can have.
Upvotes: 1