Azure AKS: Inconsistent state and incorrect number of nodes in `kubectl get nodes`

Question

I am using manual scaling in an Azure AKS cluster, which can scale up to 60 nodes.

Scaling command worked fine:

az aks scale --resource-group RG-1 --name KS-3 --node-count 46
{- Finished ..
  "agentPoolProfiles": [
    {
      "availabilityZones": null,
      "count": 46,
      ...

and reported the count of 46 nodes.

The status also shows "Succeeded":

az aks show --name KS-3 --resource-group RG-1 -o table
Name        Location    ResourceGroup       KubernetesVersion    ProvisioningState    Fqdn
----------  ----------  ------------------  -------------------  -------------------  -------------------------
      KS-3  xxxxxxx     RG-1                1.16.13              Succeeded            xxx.azmk8s.io

However, when I look at kubectl get nodes it shows only 44 nodes:

kubectl get nodes | grep -c 'aks'
44

with 7 nodes in "Ready,SchedulingDisabled" state (and the rest in Ready state):

kubectl get nodes | grep -c "Ready,SchedulingDisabled"
7

When I try to scale the cluster down to 45 nodes, it gives this error:

Deployment failed. Correlation ID: xxxx-xxx-xxx-x. Node 'aks-nodepool1-xxxx-vmss000yz8' failed to be drained with error: 'nodes "aks-nodepool1-xxxx-vmss000yz8" not found'

I am not sure what got the cluster into this inconsistent state and how to go about debugging this.

arun · Accepted Answer

It happened because two of the nodes were in corrupted state. We had to delete these nodes from the VM scale set in the node resource group associated with our AKS cluster. Not sure why the nodes got into this state though.

Azure AKS: Inconsistent state and incorrect number of nodes in `kubectl get nodes`

Answers (1)

Related Questions