Reputation: 5776
I have a Kubernetes cluster that was initialized using the kube-up.sh
script inside AWS, and occasionally there's a very slow DNS lookup when finding one service from inside another pod. Here's the basic picture:
(browser)
|
V
(ELB)
|
V
(front-end service)
|
V
(front-end pod)
|
V
(back-end service)
|
V
(back-end pod)
|
V
(database)
I have timing logging installed at the front-end and back-end levels, and their numbers are wildly divergent for some requests. Occasionally we'll see a request that the FE nginx logging says takes 8.3 seconds, but the back-end gunicorn process says takes 30ms.
I can exec
into the FE pod and do a curl
to the backend endpoint to get timing data according to the example in this blog post, and it looks like this:
time_namelookup: 3.513
time_connect: 3.513
time_appconnect: 0.000
time_pretransfer: 3.513
time_redirect: 0.000
time_starttransfer: 3.520
----------
time_total: 3.520
So the slowness seems to be coming from DNS. We have a separate cluster set up for staging, and this sort of thing doesn't seem to be happening there, so I'm not sure what to make of it. Most requests happen in a reasonable amount of time, less than 50ms, but every tenth one or so takes multiple seconds to resolve.
I found this thread that made it sound like SkyDNS's use of etcd might be the problem, but I'm not sure how to verify that or fix it. And this is happening way too often to be periodic missing configuration values (our traffic isn't that high).
Upvotes: 3
Views: 4862
Reputation: 2682
By default, kubernetes configures the pods to use both skydns (to resolve service names) as well as the resolver of the underlying infrastructure (to resolve external requests). The resolver library inside the docker container will then send requests to skydns or the external resolver in a round robin way.. it also tries to generates requests by first including the full name (e.g. service.namespace.svc.domain) and then trimming the name (e.g. service.namespace.svc; service.namespace). This can result in longer timeouts if the first request is sent to the wrong server.
In case you don't care about the external resolver, you can override the resolution behaviour with the kubelet flag "--resolv-conf" which allows you to specify an alternate set of external resolvers (or none).
Upvotes: 2
Reputation: 734
There was a bug that was fixed here (https://github.com/kubernetes/kubernetes/pull/13345) that has been shown to cause this problem in Kubernetes clusters 1.0.5 and older. The problem is fixed in the 1.0.6 release.
Upvotes: 4