Robrecht Vanhuysse
Robrecht Vanhuysse

Reputation: 21

Dns lookup for docker container breaks after ~36 hours of uptime

I have a single container deployed via docker-compose (dns is done through the docker daemon dns server 127.0.0.11) on a host with dns server configured for a private network in the /etc/resolv.conf and no access to the internet.

The container runs fine for a while (about 40 hours) then starts failing its dns lookups with timeout messages: the application logs show failures against the docker dns server:

Caused by: java.net.UnknownHostException: failed to resolve 'alfresco.test.duf'
        at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1013)
        at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:966)
        at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:414)
        at io.netty.resolver.dns.DnsResolveContext.access$600(DnsResolveContext.java:63)
        at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:463)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
        at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:225)
        at io.netty.resolver.dns.DnsQueryContext$4.run(DnsQueryContext.java:177)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:834)
    Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/127.0.0.11:53] query via UDP timed out after 5000 milliseconds (no stack trace available)

The docker daemon log shows failures against the local network dns server:

Aug 25 12:19:15 st2510v dockerd[6749]: time="2021-08-25T12:19:15.066556867+02:00" level=warning msg="[resolver] connect failed: dial udp 157.164.138.33:53: connect: resource temporarily unavailable"

Pinging the target server from the docker host resolves correctly.

Starting a bash container in the docker-network (created through compose) and pinging the target server from there resolves correctly.

Pinging any server (external dns, docker dns, bashcontainer) form within the problematic container fails to resolve.

The container does not recover from the error on its own.

Restarting or recreating the container does fix the issue.

Ive compared the host iptables and network interfaces with a working instance that does not have the issue at all, but this did not yield any significant differences.

Any advice on what the issue is, or how to diagnose what it might be?

Update 1

Docker version output:

[al6735@st2510v ~]$ sudo docker version
Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.40
 Go version:        go1.12.12
 Git commit:        633a0ea
 Built:             Wed Nov 13 07:25:41 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.5
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.12
  Git commit:       633a0ea
  Built:            Wed Nov 13 07:24:18 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Docker info output:

[al6735@st2510v ~]$ sudo docker info
Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 19.03.5
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-957.21.2.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.51GiB
 Name: st2510v
 ID: KTEE:M3ZD:5ZS5:DVFU:R6VJ:YV7Q:QPP5:D4YG:ITV7:YC3U:YP3J:AEDG
 Docker Root Dir: /home/docker
 Debug Mode: true
  File Descriptors: 38
  Goroutines: 48
  System Time: 2021-09-24T14:23:42.314595155+02:00
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Upvotes: 0

Views: 1424

Answers (1)

Robrecht Vanhuysse
Robrecht Vanhuysse

Reputation: 21

Further inspection of the host showed that the java application in the target container was holding a lot of tcp sockets.

After fixing the above, the connection issue did not occur any more. Presumably we hit a limit on the amount of open sockets a container can have.

Upvotes: 1

Related Questions