Wallace
Wallace

Reputation: 651

Kubernetes Pod Stuck in Pending Without Indicating Any Reason

We are using client-go to create kubernetes jobs and deployments. Today in one of our cluster (kubernetes v1.18.19), I encounter below weird problem.

Pods of kubernetes Job are always stuck in Pending status, without any reasons. kubectl describe pod shows there are no events. Creating Jobs from host (via kubectl) are normal and pods became running eventually.

What surprises me is Creating Deployments is ok, pods get running eventually!! It won't work only for Kubernetes Jobs. Why? How to fix that?? What I can do?? I have taken hours here but got no progress.

kubeconfig by client-go:

Mount from host machine, path: /root/.kube/config

kubectl describe job shows:

Name:           unittest
Namespace:      default
Selector:       controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
Labels:         job-id=unittest
Annotations:    <none>
Parallelism:    1
Completions:    1
Start Time:     Sat, 19 Jun 2021 00:20:12 +0800
Pods Statuses:  1 Running / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
           job-name=unittest
  Containers:
   unittest:
    Image:      ubuntu:18.04
    Port:       <none>
    Host Port:  <none>
    Command:
      echo hello
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  21m   job-controller  Created pod: unittest-tt5b2

Kubectl describe on target pod shows:

Name:           unittest-tt5b2
Namespace:      default
Priority:       0
Node:           <none>
Labels:         controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
               job-name=unittest
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  Job/unittest
Containers:
 unittest:
   Image:      ubuntu:18.04
   Port:       <none>
   Host Port:  <none>
   Command:
     echo hello
   Environment:  <none>
   Mounts:
     /var/run/secrets/kubernetes.io/serviceaccount from default-token-72g27 (ro)
Volumes:
 default-token-72g27:
   Type:        Secret (a volume populated by a Secret)
   SecretName:  default-token-72g27
   Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none> 

kubectl get events shows:

55m         Normal    ScalingReplicaSet   deployment/job-scheduler              Scaled up replica set job-scheduler-76b7465d74 to 1
19m         Normal    ScalingReplicaSet   deployment/job-scheduler              Scaled up replica set job-scheduler-74f8896f48 to 1
58m         Normal    SuccessfulCreate    job/unittest                          Created pod: unittest-pp665
49m         Normal    SuccessfulCreate    job/unittest                          Created pod: unittest-xm6ck
17m         Normal    SuccessfulCreate    job/unittest                          Created pod: unittest-tt5b2

Upvotes: 1

Views: 2243

Answers (1)

Wallace
Wallace

Reputation: 651

I fixed the issue.

We use a custom scheduler for NPU devices and default scheduler for GPU devices. For GPU devices, the scheduler name is "default-scheduler" other than "default". I passed "default" for those kube Jobs, this causes the pods to stuck in pending.

Upvotes: 3

Related Questions