Reputation: 1169
I am trying to install the latest version of NVIDIA Clara Deploy Bootstrap following the official documentations (this & this). At one step of the installation, these is a shellscript named "bootstrap.sh" - which is meant to install all the dependencies including Kubernetes & kubectl, along with cluster creation. But upon running sudo ./bootstrap.sh
, I am getting this error: error: the server doesn't have a resource type "pods"
.
What I have done so far:
I am fairly new to Kubernetes. So I've tried solution from this answer, tried to run kubectl get pods
which gives me No resources found.
. I have also tried kubectl auth can-i get pods
which gives me yes
. Inside etc/kubernetes/manifests, it was empty which is supposed to have conf files that I have looked from the answer, so I ran sudo kubeadm init
.
Here is the full error message:
2020-10-17 20:57:37 [INFO]: Clara Deploy SDK System Prerequisites Installation
2020-10-17 20:57:37 [INFO]: Checking user privilege...
2020-10-17 20:57:37 [INFO]: Checking for NVIDIA GPU driver...
2020-10-17 20:57:37 [INFO]: NVIDIA CUDA driver version found: 418.87.01
2020-10-17 20:57:37 [INFO]: NVIDIA GPU driver found
2020-10-17 20:57:37 [INFO]: Check and install required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-release
dirmngr jq ...
Ign:1 http://deb.debian.org/debian stretch InRelease
Get:2 http://security.debian.org stretch/updates InRelease [53.0 kB]
Get:3 http://deb.debian.org/debian stretch-updates InRelease [93.6 kB]
Get:4 http://deb.debian.org/debian stretch-backports InRelease [91.8 kB]
Hit:5 http://deb.debian.org/debian stretch Release
Hit:6 http://packages.cloud.google.com/apt gcsfuse-stretch InRelease
Get:7 https://download.docker.com/linux/debian stretch InRelease [44.8 kB]
Get:8 http://packages.cloud.google.com/apt cloud-sdk-stretch InRelease [6,389 B]
Get:9 http://security.debian.org stretch/updates/main Sources [263 kB]
Hit:10 http://packages.cloud.google.com/apt google-compute-engine-stretch-stable InRelease
Get:11 http://security.debian.org stretch/updates/main amd64 Packages [604 kB]
Get:12 http://security.debian.org stretch/updates/main Translation-en [267 kB]
Hit:13 http://packages.cloud.google.com/apt google-cloud-packages-archive-keyring-stretch InRelease
Hit:14 https://nvidia.github.io/libnvidia-container/stable/debian9/amd64 InRelease
Hit:16 https://nvidia.github.io/nvidia-container-runtime/stable/debian9/amd64 InRelease
Hit:15 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
Hit:18 https://nvidia.github.io/nvidia-docker/debian9/amd64 InRelease
Fetched 1,424 kB in 1s (1,175 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree
Reading state information... Done
apt-transport-https is already the newest version (1.4.10).
ca-certificates is already the newest version (20200601~deb9u1).
dirmngr is already the newest version (2.1.18-8~deb9u4).
jq is already the newest version (1.5+dfsg-1.3).
lsb-release is already the newest version (9.20161125).
network-manager is already the newest version (1.6.2-3+deb9u2).
unzip is already the newest version (6.0-21+deb9u2).
curl is already the newest version (7.52.1-5+deb9u12).
software-properties-common is already the newest version (0.96.20.2-1+deb9u1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
2020-10-17 20:57:41 [INFO]: Starting network-manager service...
2020-10-17 20:57:41 [INFO]: Successfully installed required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-re
lease dirmngr jq !
2020-10-17 20:57:41 [INFO]: Disabling swap ...
2020-10-17 20:57:41 [INFO]: Start installing docker and nvidia-docker2 ...
2020-10-17 20:57:41 [INFO]: 'proteeti_prova' is already added to docker group. Skipping docker group configuration ...
2020-10-17 20:57:41 [INFO]: Skipping nvidia-docker install since it is already present.
WARNING: No swap limit support
2020-10-17 20:57:42 [INFO]: Docker Compose version 1.25.4 is already installed. Skipping docker-compose installation...
2020-10-17 20:57:42 [INFO]: The following versions of k8s components are already installed.
Error from server (NotFound): the server could not find the requested resource
2020-10-17 20:57:43 [INFO]: - kubectl: Client Version: v1.15.4
2020-10-17 20:57:43 [INFO]: - kubelet: Kubernetes v1.15.4
2020-10-17 20:57:44 [INFO]: - kubeadm: v1.15.4
2020-10-17 20:57:45 [INFO]: Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
error: the server doesn't have a resource type "pods"
Upvotes: 1
Views: 6594
Reputation: 8411
1. Instance:
GCP, Ubuntu 18.04
n1-standard-16 (16 vCPUs, 60 GB memory)
1 x NVIDIA Tesla T4
2. Downloading bootstrap, unpacking:
$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_bootstrap/versions/0.7.1-2008.1/files/bootstrap.zip
$unzip bootstrap.zip -d bootstrap
3. Installing cuda as a prerequisite and reboot:
$wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
$sudo apt-get update
$sudo apt-get -y install cuda
$sudo reboot
4. Enable IP Forwarding after reboot:
$sudo -s
#echo 1 > /proc/sys/net/ipv4/ip_forward
5. Running bootstrap.sh
(1st time).
kubelet.service
shows code=exited, status=255
error:
$sudo ./bootstrap/bootstrap.sh
...
...
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Mon 2020-10-19 10:40:54 UTC; 2s ago
Docs: https://kubernetes.io/docs/home/
Process: 2356 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
Main PID: 2356 (code=exited, status=255)
This error means you should run kubeadm init
manually. So, run kubeadm init --pod-network-cidr=10.244.0.0/16
and then check again sudo service kubelet status
to be sure it is running as expected. All the kubernetes configs will be generated for you during kubeadm init --pod-network-cidr=10.244.0.0/16
.
6. We add --pod-network-cidr=10.244.0.0/16
because we will use Flannel CNI. You can check the same in the bootstrap.sh
, line 334 if ! sudo kubeadm init --pod-network-cidr="10.244.0.0/16"; then
$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.15.12
[preflight] Pulling images required for setting up a Kubernetes cluster
...
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
...
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
...
[apiclient] All control plane components are healthy after 19.501975 seconds
...
Your Kubernetes control-plane has initialized successfully!.
...
$ sudo service kubelet status
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2020-10-19 13:42:22 UTC; 4min 15s ago
7. Next is regular step to be able run kubectl commands from your user instead of root
$mkdir -p $HOME/.kube
$sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$sudo chown $(id -u):$(id -g) $HOME/.kube/config
8. Show everything currently installed
$ kubectl get all -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-5c98db65d4-cpz4s 0/1 Pending 0 4m17s
kube-system pod/coredns-5c98db65d4-kgzg8 0/1 Pending 0 4m17s
kube-system pod/etcd-clara 1/1 Running 0 3m10s
kube-system pod/kube-apiserver-clara 1/1 Running 0 3m35s
kube-system pod/kube-controller-manager-clara 1/1 Running 0 3m17s
kube-system pod/kube-proxy-8qx4z 1/1 Running 0 4m18s
kube-system pod/kube-scheduler-clara 1/1 Running 0 3m23s
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 4m35s
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 4m34s
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/kube-proxy 1 1 1 1 1 beta.kubernetes.io/os=linux 4m33s
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/2 2 0 4m34s
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 0 4m18s
Take your attention: currently coredns pods
are in the Pending
state. Also you can see not ready coredns deployment
and replicaset
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/2 2 0 4m34s
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 0 4m18s
They are waiting till you will apply flannel configuration yaml. These are the lines from the same script
info "Deploy kubernetes pod network."
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel.yml
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel-rbac.yml
If you will not do this and rerun script at this moment - you will receive an error with the timeout
2020-10-19 14:14:03 [INFO]: coredns pods are not running yet ...
9. Deploy Flannel
$ kubectl apply -f bootstrap/kube-flannel.yml
podsecuritypolicy.extensions/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.extensions/kube-flannel-ds-amd64 created
daemonset.extensions/kube-flannel-ds-arm64 created
daemonset.extensions/kube-flannel-ds-arm created
daemonset.extensions/kube-flannel-ds-ppc64le created
daemonset.extensions/kube-flannel-ds-s390x created
$ kubectl apply -f bootstrap/kube-flannel-rbac.yml
clusterrole.rbac.authorization.k8s.io/flannel configured
clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged
Immediately after that everything related to coredns
will start to work. Pods
will be created and in Running
state, deployment
and replicaset
will be in the proper state.
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-5c98db65d4-cpz4s 1/1 Running 0 21m
kube-system pod/coredns-5c98db65d4-kgzg8 1/1 Running 0 21m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 2/2 2 2 21m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 2 21m
In addition you will see flannel related new pod
and daemonsets
kube-system pod/kube-flannel-ds-amd64-64jbv 1/1 Running 0 3m59s
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/kube-flannel-ds-amd64 1 1 1 1 1 beta.kubernetes.io/arch=amd64 3m59s
kube-system daemonset.apps/kube-flannel-ds-arm 0 0 0 0 0 beta.kubernetes.io/arch=arm 3m59s
kube-system daemonset.apps/kube-flannel-ds-arm64 0 0 0 0 0 beta.kubernetes.io/arch=arm64 3m59s
kube-system daemonset.apps/kube-flannel-ds-ppc64le 0 0 0 0 0 beta.kubernetes.io/arch=ppc64le 3m59s
kube-system daemonset.apps/kube-flannel-ds-s390x 0 0 0 0 0 beta.kubernetes.io/arch=s390x 3m59s
10. Finally its time to continue running script. It will TRY!!! to install helm
, tiller
and restart dockerd
. Everything is fine except TILLER
...
$sudo ./bootstrap/bootstrap.sh
[INFO]: Clara Deploy SDK System Prerequisites Installation
...
Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
./bootstrap/bootstrap.sh: line 412: helm: command not found
...
[INFO]: Start installing helm ...
...
[INFO]: Restarting dockerd...
The connection to the server *.*.*.*:6443 was refused - did you specify the right host or port?
[INFO]: Waiting for Kubernetes to be ready...
Kubernetes master is running at https://*.*.*.*:6443
KubeDNS is running at https://*.*.*.*:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
...
[INFO]: Updating permissions...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
11. We have NO Tiller pod. Also deployment and replicaset as a result is broken...
kube-system deployment.apps/tiller-deploy 0/1 0 0 7m26s
kube-system replicaset.apps/tiller-deploy-659c6788f5 1 0 0 7m26s
I don't see any other solution here rather then manually delete tiller's related components(deployment, service) and reinstall from scratch..with small workarounds..
#delete tiller
$kubectl delete deployment tiller-deploy -n kube-system
$kubectl delete deployment tiller-deploy -n kube-system
#install helm,tiller
$curl https://raw.githubusercontent.com/helm/helm/master/scripts/get | bash
$kubectl create serviceaccount --namespace kube-system tiller
$kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
$helm init --service-account tiller
Now if you will check what has been deployed - you will clearly see that tiller-pod
is in the pending state, as like tiller-deploy
deployment is not ready
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/tiller-deploy-67847cd9b9-vlzm6 0/1 Pending 0 11m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/tiller-deploy 0/1 1 0 11m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/tiller-deploy-67847cd9b9 1 1 0 11m
12. Fixing tiller
Lets describe tiller pod and find tolerations
$ kubectl describe pod tiller-deploy-67847cd9b9-vlzm6 -n kube-system
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
I won't explain why(you would read about tolerations on your own), but fix is to allow master run pods...
$kubectl taint nodes --all node-role.kubernetes.io/master-
After that you will see
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/tiller-deploy-67847cd9b9-vlzm6 1/1 Running 0 13m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/tiller-deploy 1/1 1 1 13m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/tiller-deploy-67847cd9b9 1 1 1 13m
13. Next, installing all components:
$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_cli/versions/0.7.1-2008.1/files/cli.zip
$sudo unzip cli.zip -d /usr/bin/ && sudo chmod 755 /usr/bin/clara*
$ clara version
Clara CLI version: 0.7.1-12788.ae65aea0
$ clara config --key KEY --orgteam nvidia/clara -y
Configuration "ngc-clara"successfully created
$ clara pull platform
Clara Platform 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara
$ clara platform start
Starting clara...
NAME: clara
$ clara pull dicom
Clara Dicom Adapter 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/dicom-adapter
$ clara pull render
Clara Renderer 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-renderer
$ clara pull monitor
Clara Monitor Server 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-monitor-server
$ clara pull console
Clara Management Console 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-console
$ clara dicom start
Starting DICOM Adapter...
NAME: clara-dicom-adapter
$ clara render start
NAME: clara-render-server
$ clara monitor start
NAME: clara-monitor-server
$ clara console start
NAME: clara-console
14. To verify that the installation is successful, run the following command:
$ helm ls
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
clara 1 Mon Oct 19 16:16:36 2020 DEPLOYED clara-0.7.1-2008.1 1.0 default
clara-console 1 Mon Oct 19 16:28:30 2020 DEPLOYED clara-console-0.7.1-2008.1 1.0 default
clara-dicom-adapter 1 Mon Oct 19 16:22:36 2020 DEPLOYED dicom-adapter-0.7.1-2008.1 1.0 default
clara-monitor-server 1 Mon Oct 19 16:26:35 2020 DEPLOYED clara-monitor-server-0.7.1-2008.1 1.0 default
clara-render-server 1 Mon Oct 19 16:22:54 2020 DEPLOYED clara-renderer-0.7.1-2008.1 1.0 default
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
clara-clara-platformapiserver-54c5c44bbd-gqdd6 1/1 Running 0 13m
clara-console-8565b4d565-wcbg5 2/2 Running 0 2m2s
clara-console-mongodb-85f8bd5f95-ts2gp 1/1 Running 0 2m2s
clara-dicom-adapter-7948fcd445-mnsjd 1/1 Running 0 7m56s
clara-monitor-server-fluentd-elasticsearch-6zvhq 1/1 Running 0 3m57s
clara-monitor-server-grafana-5f874b974d-6l4s8 1/1 Running 0 3m57s
clara-monitor-server-monitor-server-59c8bf68f7-5dgxq 1/1 Running 0 3m57s
clara-render-server-clara-renderer-d79dd4779-wcjrv 3/3 Running 0 7m38s
clara-resultsservice-664477898f-9nk4f 1/1 Running 0 13m
clara-ui-6f89b97df8-792f6 1/1 Running 0 13m
clara-workflow-controller-69cbb55fc8-zjhdm 1/1 Running 0 13m
elasticsearch-master-0 1/1 Running 0 3m57s
elasticsearch-master-1 1/1 Running 0 3m57s
fluentd-km8nj 1/1 Running 0 13m
P.S. Sure it was much easier to fix the script for you, but I decided to show you whats going on in the background. Im sure you will do it on your own, if needed.
Upvotes: 2