Ravi Reddy
Ravi Reddy

Reputation: 186

Kubernetes pods (some) die after running for a day

We are trying a test setup with Kubernetes version 1.0.6 on AWS.

This setup involves pods for Cassandra (2-nodes), Spark (master, 2-workers, driver), RabbitMQ(1-node). Some the pods this setup die after a day or so

Is there way to get logs from Kubernetes on how/why they died?

When you try to restart died pods manually, you get some pods status as ''category/spark-worker is ready, container is creating' and the pod start never completes.

Only option in the scenario is to "kube-down.sh and then kube-up.sh" and go through entire setup from scratch.

Upvotes: 2

Views: 6198

Answers (2)

Adam Romanek
Adam Romanek

Reputation: 1879

Your nodes have probably run out of disk space due to an issue in Kubernetes.

An indirect fix is available in just recently released Kubernetes v1.0.7.

AWS: Create one storage pool for aufs, not two #13803 (justinsb)

but as described in the above-mentioned issue there's still some work to do in this area.

Upvotes: 1

Yu-Ju Hong
Yu-Ju Hong

Reputation: 7287

kubectl describe ${POD_NAME} or kubectl logs ${POD_NAME} ${CONTAINER_NAME} should give you more information to debug.

Please also see https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/application-troubleshooting.md#debugging-pods for general troubleshooting instructions.

EDIT:

After discussing in the comments, I think the problem with your node is that the node was unresponsive for >5 minutes (potentially due to high memory usage of influxdb). Node controller then deemed the node not ready and evicted all pods on the node. Note that pods managed by replication controllers would be re-created (with a different name), but pods created manually would not be recreated.

If you suspect influxdb memory usage is the root cause, you can try not running this pod to see if the problem resolves itself. Alternatively, you can change the memory limit of influxdb container to a smaller value.

EDIT2:

Some tips for finding out what happened to the node:

  1. Check /var/log/kubelet.log. This is the easiest approach.

  2. kubectl describe nodes or kubectl get events | grep <node_name> (for older version of kubernetes)

This command would give you the events associated with the node status. However, the events are flushed every two hours, so you would need to run this command within the window of time after your node encounters the problem.

  1. kubectl get node <node_name> -o yaml --watch lets you monitor the node object, including its status in yaml. This would be updated periodically.

Upvotes: 2

Related Questions