How to properly recover a K8s cluster after reboot?

Question

I have a K8s cluster that was working properly but because of power failure, all the nodes got rebooted.

At the moment I have some problem recovering the master (and other nodes):

sudo systemctl kubelet status is returning Unknown operation kubelet. but when I run kubeadm init ... (the command that I set up the cluster with) it returns:

error execution phase preflight: [preflight] Some fatal errors occurred:
    [ERROR Port-6443]: Port 6443 is in use
    [ERROR Port-10251]: Port 10251 is in use
    [ERROR Port-10252]: Port 10252 is in use
    [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
    [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
    [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
    [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
    [ERROR Port-10250]: Port 10250 is in use
    [ERROR Port-2379]: Port 2379 is in use
    [ERROR Port-2380]: Port 2380 is in use
    [ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty

and when I checked those ports I can see that kubelet and other K8s components are using them:

~/k8s-multi-node$ sudo lsof -i :10251
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
kube-sche 26292 root    3u  IPv6 104933      0t0  TCP *:10251 (LISTEN)

~/k8s-multi-node$ sudo lsof -i :10252
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
kube-cont 26256 root    3u  IPv6 115541      0t0  TCP *:10252 (LISTEN)

~/k8s-multi-node$ sudo lsof -i :10250
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
kubelet 24781 root   27u  IPv6 106821      0t0  TCP *:10250 (LISTEN)

I tried to kill them but they start to use those ports again.

My second problem is because of the power failure my machines don't have access to internet at the moment.

So what is the proper way to recover such a cluster? Do I need to remove kubelet and all the otehr components and install them again?

Arghya Sadhu · Accepted Answer

You need to first stop kubelet using sudo systemctl stop kubelet.service

After that run kubeadm reset and then kubeadm init. Note that this will clean up existing cluster and create a new cluster altogether.

Regarding proper way to recover check this question

How to properly recover a K8s cluster after reboot?

Answers (1)

Related Questions