Proteeti Prova
Proteeti Prova

Reputation: 1169

"server doesn't have a resource type "pods"" while installing NVIDIA Clara Deploy

I am trying to install the latest version of NVIDIA Clara Deploy Bootstrap following the official documentations (this & this). At one step of the installation, these is a shellscript named "bootstrap.sh" - which is meant to install all the dependencies including Kubernetes & kubectl, along with cluster creation. But upon running sudo ./bootstrap.sh, I am getting this error: error: the server doesn't have a resource type "pods".

What I have done so far: I am fairly new to Kubernetes. So I've tried solution from this answer, tried to run kubectl get pods which gives me No resources found.. I have also tried kubectl auth can-i get podswhich gives me yes. Inside etc/kubernetes/manifests, it was empty which is supposed to have conf files that I have looked from the answer, so I ran sudo kubeadm init.

Here is the full error message:

2020-10-17 20:57:37 [INFO]: Clara Deploy SDK System Prerequisites Installation
2020-10-17 20:57:37 [INFO]: Checking user privilege...
 
2020-10-17 20:57:37 [INFO]: Checking for NVIDIA GPU driver...
2020-10-17 20:57:37 [INFO]: NVIDIA CUDA driver version found: 418.87.01
2020-10-17 20:57:37 [INFO]: NVIDIA GPU driver found
2020-10-17 20:57:37 [INFO]: Check and install required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-release
 dirmngr jq ...
Ign:1 http://deb.debian.org/debian stretch InRelease
Get:2 http://security.debian.org stretch/updates InRelease [53.0 kB]
Get:3 http://deb.debian.org/debian stretch-updates InRelease [93.6 kB]          
Get:4 http://deb.debian.org/debian stretch-backports InRelease [91.8 kB]               
Hit:5 http://deb.debian.org/debian stretch Release 
Hit:6 http://packages.cloud.google.com/apt gcsfuse-stretch InRelease
Get:7 https://download.docker.com/linux/debian stretch InRelease [44.8 kB]
Get:8 http://packages.cloud.google.com/apt cloud-sdk-stretch InRelease [6,389 B]                                       
Get:9 http://security.debian.org stretch/updates/main Sources [263 kB]                            
Hit:10 http://packages.cloud.google.com/apt google-compute-engine-stretch-stable InRelease             
Get:11 http://security.debian.org stretch/updates/main amd64 Packages [604 kB]                                       
Get:12 http://security.debian.org stretch/updates/main Translation-en [267 kB]                                                 
Hit:13 http://packages.cloud.google.com/apt google-cloud-packages-archive-keyring-stretch InRelease                                   
Hit:14 https://nvidia.github.io/libnvidia-container/stable/debian9/amd64  InRelease            
Hit:16 https://nvidia.github.io/nvidia-container-runtime/stable/debian9/amd64  InRelease
Hit:15 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
Hit:18 https://nvidia.github.io/nvidia-docker/debian9/amd64  InRelease
Fetched 1,424 kB in 1s (1,175 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
apt-transport-https is already the newest version (1.4.10).
ca-certificates is already the newest version (20200601~deb9u1).
dirmngr is already the newest version (2.1.18-8~deb9u4).
jq is already the newest version (1.5+dfsg-1.3).
lsb-release is already the newest version (9.20161125).
network-manager is already the newest version (1.6.2-3+deb9u2).
unzip is already the newest version (6.0-21+deb9u2).
curl is already the newest version (7.52.1-5+deb9u12).
software-properties-common is already the newest version (0.96.20.2-1+deb9u1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
2020-10-17 20:57:41 [INFO]: Starting network-manager service...
2020-10-17 20:57:41 [INFO]: Successfully installed required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-re
lease dirmngr jq !
2020-10-17 20:57:41 [INFO]: Disabling swap ...
2020-10-17 20:57:41 [INFO]: Start installing docker and nvidia-docker2 ...
2020-10-17 20:57:41 [INFO]: 'proteeti_prova' is already added to docker group. Skipping docker group configuration ...
2020-10-17 20:57:41 [INFO]: Skipping nvidia-docker install since it is already present.
WARNING: No swap limit support
2020-10-17 20:57:42 [INFO]: Docker Compose version 1.25.4 is already installed. Skipping docker-compose installation...
2020-10-17 20:57:42 [INFO]: The following versions of k8s components are already installed.
Error from server (NotFound): the server could not find the requested resource
2020-10-17 20:57:43 [INFO]: - kubectl: Client Version: v1.15.4
2020-10-17 20:57:43 [INFO]: - kubelet: Kubernetes v1.15.4
2020-10-17 20:57:44 [INFO]: - kubeadm: v1.15.4
2020-10-17 20:57:45 [INFO]: Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
error: the server doesn't have a resource type "pods"

Upvotes: 1

Views: 6594

Answers (1)

Vit
Vit

Reputation: 8411

1. Instance:

GCP, Ubuntu 18.04
n1-standard-16 (16 vCPUs, 60 GB memory)
1 x NVIDIA Tesla T4

2. Downloading bootstrap, unpacking:

$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_bootstrap/versions/0.7.1-2008.1/files/bootstrap.zip
$unzip bootstrap.zip -d bootstrap

3. Installing cuda as a prerequisite and reboot:

$wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
$sudo apt-get update
$sudo apt-get -y install cuda
$sudo reboot

4. Enable IP Forwarding after reboot:

$sudo -s
#echo 1 > /proc/sys/net/ipv4/ip_forward

5. Running bootstrap.sh(1st time).

kubelet.service shows code=exited, status=255 error:

$sudo ./bootstrap/bootstrap.sh
...
...
● kubelet.service - kubelet: The Kubernetes Node Agent
       Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
      Drop-In: /etc/systemd/system/kubelet.service.d
               └─10-kubeadm.conf
       Active: activating (auto-restart) (Result: exit-code) since Mon 2020-10-19 10:40:54 UTC; 2s ago
         Docs: https://kubernetes.io/docs/home/
      Process: 2356 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
     Main PID: 2356 (code=exited, status=255)

This error means you should run kubeadm init manually. So, run kubeadm init --pod-network-cidr=10.244.0.0/16 and then check again sudo service kubelet status to be sure it is running as expected. All the kubernetes configs will be generated for you during kubeadm init --pod-network-cidr=10.244.0.0/16.

6. We add --pod-network-cidr=10.244.0.0/16 because we will use Flannel CNI. You can check the same in the bootstrap.sh, line 334 if ! sudo kubeadm init --pod-network-cidr="10.244.0.0/16"; then

$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.15.12
[preflight] Pulling images required for setting up a Kubernetes cluster
...
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
...
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
...
[apiclient] All control plane components are healthy after 19.501975 seconds
...
Your Kubernetes control-plane has initialized successfully!.
...
$ sudo service kubelet status
    ● kubelet.service - kubelet: The Kubernetes Node Agent
       Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
      Drop-In: /etc/systemd/system/kubelet.service.d
               └─10-kubeadm.conf
       Active: active (running) since Mon 2020-10-19 13:42:22 UTC; 4min 15s ago

7. Next is regular step to be able run kubectl commands from your user instead of root

$mkdir -p $HOME/.kube
$sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$sudo chown $(id -u):$(id -g) $HOME/.kube/config

8. Show everything currently installed

$ kubectl get all -A
NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-5c98db65d4-cpz4s        0/1     Pending   0          4m17s
kube-system   pod/coredns-5c98db65d4-kgzg8        0/1     Pending   0          4m17s
kube-system   pod/etcd-clara                      1/1     Running   0          3m10s
kube-system   pod/kube-apiserver-clara            1/1     Running   0          3m35s
kube-system   pod/kube-controller-manager-clara   1/1     Running   0          3m17s
kube-system   pod/kube-proxy-8qx4z                1/1     Running   0          4m18s
kube-system   pod/kube-scheduler-clara            1/1     Running   0          3m23s
    
    
NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP                  4m35s
kube-system   service/kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   4m34s
    
NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
kube-system   daemonset.apps/kube-proxy   1         1         1       1            1           beta.kubernetes.io/os=linux   4m33s
    
NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           4m34s
    
NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5c98db65d4   2         2         0       4m18s

Take your attention: currently coredns pods are in the Pending state. Also you can see not ready coredns deployment and replicaset

NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           4m34s
    
NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5c98db65d4   2         2         0       4m18s

They are waiting till you will apply flannel configuration yaml. These are the lines from the same script

info "Deploy kubernetes pod network."
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel.yml
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel-rbac.yml

If you will not do this and rerun script at this moment - you will receive an error with the timeout

2020-10-19 14:14:03 [INFO]: coredns pods are not running yet ...

9. Deploy Flannel

$ kubectl apply -f bootstrap/kube-flannel.yml
podsecuritypolicy.extensions/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.extensions/kube-flannel-ds-amd64 created
daemonset.extensions/kube-flannel-ds-arm64 created
daemonset.extensions/kube-flannel-ds-arm created
daemonset.extensions/kube-flannel-ds-ppc64le created
daemonset.extensions/kube-flannel-ds-s390x created
    
$ kubectl apply -f bootstrap/kube-flannel-rbac.yml
clusterrole.rbac.authorization.k8s.io/flannel configured
clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged

Immediately after that everything related to coredns will start to work. Pods will be created and in Running state, deployment and replicaset will be in the proper state.

NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-5c98db65d4-cpz4s        1/1     Running   0          21m
kube-system   pod/coredns-5c98db65d4-kgzg8        1/1     Running   0          21m
    
NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   2/2     2            2           21m
    
NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5c98db65d4   2         2         2       21m

In addition you will see flannel related new pod and daemonsets

kube-system   pod/kube-flannel-ds-amd64-64jbv     1/1     Running   0          3m59s
    
    
NAMESPACE     NAME                                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
kube-system   daemonset.apps/kube-flannel-ds-amd64     1         1         1       1            1           beta.kubernetes.io/arch=amd64     3m59s
kube-system   daemonset.apps/kube-flannel-ds-arm       0         0         0       0            0           beta.kubernetes.io/arch=arm       3m59s
kube-system   daemonset.apps/kube-flannel-ds-arm64     0         0         0       0            0           beta.kubernetes.io/arch=arm64     3m59s
kube-system   daemonset.apps/kube-flannel-ds-ppc64le   0         0         0       0            0           beta.kubernetes.io/arch=ppc64le   3m59s
kube-system   daemonset.apps/kube-flannel-ds-s390x     0         0         0       0            0           beta.kubernetes.io/arch=s390x     3m59s

10. Finally its time to continue running script. It will TRY!!! to install helm, tillerand restart dockerd. Everything is fine except TILLER...

$sudo ./bootstrap/bootstrap.sh
[INFO]: Clara Deploy SDK System Prerequisites Installation
...
Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
./bootstrap/bootstrap.sh: line 412: helm: command not found
...
[INFO]: Start installing helm ...
...
[INFO]: Restarting dockerd...
The connection to the server *.*.*.*:6443 was refused - did you specify the right host or port?
[INFO]: Waiting for Kubernetes to be ready...
Kubernetes master is running at https://*.*.*.*:6443
KubeDNS is running at https://*.*.*.*:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
...
[INFO]: Updating permissions...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...

11. We have NO Tiller pod. Also deployment and replicaset as a result is broken...

kube-system   deployment.apps/tiller-deploy   0/1  0 0 7m26s
kube-system   replicaset.apps/tiller-deploy-659c6788f5   1 0 0 7m26s

I don't see any other solution here rather then manually delete tiller's related components(deployment, service) and reinstall from scratch..with small workarounds..

#delete tiller
$kubectl delete deployment tiller-deploy -n kube-system
$kubectl delete deployment tiller-deploy -n kube-system
    
#install helm,tiller
$curl https://raw.githubusercontent.com/helm/helm/master/scripts/get | bash
$kubectl create serviceaccount --namespace kube-system tiller
$kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
$helm init --service-account tiller

Now if you will check what has been deployed - you will clearly see that tiller-pod is in the pending state, as like tiller-deploy deployment is not ready

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE
kube-system   pod/tiller-deploy-67847cd9b9-vlzm6   0/1     Pending   0          11m
    
NAMESPACE     NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/tiller-deploy   0/1     1            0           11m
    
NAMESPACE     NAME                                       DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/tiller-deploy-67847cd9b9   1         1         0       11m

12. Fixing tiller

Lets describe tiller pod and find tolerations

$ kubectl describe pod tiller-deploy-67847cd9b9-vlzm6 -n kube-system
    Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                     node.kubernetes.io/unreachable:NoExecute for 300s

I won't explain why(you would read about tolerations on your own), but fix is to allow master run pods...

$kubectl taint nodes --all node-role.kubernetes.io/master-

After that you will see

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE
kube-system   pod/tiller-deploy-67847cd9b9-vlzm6   1/1     Running   0          13m
    
NAMESPACE     NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/tiller-deploy   1/1     1            1           13m
    
NAMESPACE     NAME                                       DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/tiller-deploy-67847cd9b9   1         1         1       13m

13. Next, installing all components:

$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_cli/versions/0.7.1-2008.1/files/cli.zip
$sudo unzip cli.zip -d /usr/bin/ && sudo chmod 755 /usr/bin/clara*
    
$ clara version
Clara CLI version: 0.7.1-12788.ae65aea0
$ clara config --key KEY --orgteam nvidia/clara -y
Configuration "ngc-clara"successfully created
    
$ clara pull platform
Clara Platform 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara
    
$ clara platform start
Starting clara...
NAME:   clara
    
$ clara pull dicom
Clara Dicom Adapter 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/dicom-adapter
    
$ clara pull render
Clara Renderer 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-renderer
    
$ clara pull monitor
Clara Monitor Server 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-monitor-server
    
$ clara pull console
Clara Management Console 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-console
    
$ clara dicom start
Starting DICOM Adapter...
NAME: clara-dicom-adapter
$ clara render start
NAME: clara-render-server
$ clara monitor start
NAME: clara-monitor-server
$ clara console start
NAME: clara-console

14. To verify that the installation is successful, run the following command:

$ helm ls
NAME                    REVISION        UPDATED                         STATUS          CHART                                   APP VERSION     NAMESPACE
clara                   1               Mon Oct 19 16:16:36 2020        DEPLOYED        clara-0.7.1-2008.1                      1.0             default  
clara-console           1               Mon Oct 19 16:28:30 2020        DEPLOYED        clara-console-0.7.1-2008.1              1.0             default  
clara-dicom-adapter     1               Mon Oct 19 16:22:36 2020        DEPLOYED        dicom-adapter-0.7.1-2008.1              1.0             default  
clara-monitor-server    1               Mon Oct 19 16:26:35 2020        DEPLOYED        clara-monitor-server-0.7.1-2008.1       1.0             default  
clara-render-server     1               Mon Oct 19 16:22:54 2020        DEPLOYED        clara-renderer-0.7.1-2008.1             1.0             default  
    
    
$ kubectl get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
clara-clara-platformapiserver-54c5c44bbd-gqdd6         1/1     Running   0          13m
clara-console-8565b4d565-wcbg5                         2/2     Running   0          2m2s
clara-console-mongodb-85f8bd5f95-ts2gp                 1/1     Running   0          2m2s
clara-dicom-adapter-7948fcd445-mnsjd                   1/1     Running   0          7m56s
clara-monitor-server-fluentd-elasticsearch-6zvhq       1/1     Running   0          3m57s
clara-monitor-server-grafana-5f874b974d-6l4s8          1/1     Running   0          3m57s
clara-monitor-server-monitor-server-59c8bf68f7-5dgxq   1/1     Running   0          3m57s
clara-render-server-clara-renderer-d79dd4779-wcjrv     3/3     Running   0          7m38s
clara-resultsservice-664477898f-9nk4f                  1/1     Running   0          13m
clara-ui-6f89b97df8-792f6                              1/1     Running   0          13m
clara-workflow-controller-69cbb55fc8-zjhdm             1/1     Running   0          13m
elasticsearch-master-0                                 1/1     Running   0          3m57s
elasticsearch-master-1                                 1/1     Running   0          3m57s
fluentd-km8nj                                          1/1     Running   0          13m

P.S. Sure it was much easier to fix the script for you, but I decided to show you whats going on in the background. Im sure you will do it on your own, if needed.

Upvotes: 2

Related Questions