Reputation: 30925
I'm deploying ha vault on k8s (EKS) and getting this error on one of the vault pods, which I think is causing other pods to fail also :
This is the output of the kubectl get events
:
search for : nodes are available: 1 Insufficient memory
26m Normal Created pod/vault-1 Created container vault
26m Normal Started pod/vault-1 Started container vault
26m Normal Pulled pod/vault-1 Container image "hashicorp/vault-enterprise:1.5.0_ent" already present on machine
7m40s Warning BackOff pod/vault-1 Back-off restarting failed container
2m38s Normal Scheduled pod/vault-1 Successfully assigned vault-foo/vault-1 to ip-10-101-0-103.ec2.internal
2m35s Normal SuccessfulAttachVolume pod/vault-1 AttachVolume.Attach succeeded for volume "pvc-acfc7e26-3616-4075-ab79-0c3f7b0f6470"
2m35s Normal SuccessfulAttachVolume pod/vault-1 AttachVolume.Attach succeeded for volume "pvc-19d03d48-1de2-41f8-aadf-02d0a9f4bfbd"
48s Normal Pulled pod/vault-1 Container image "hashicorp/vault-enterprise:1.5.0_ent" already present on machine
48s Normal Created pod/vault-1 Created container vault
99s Normal Started pod/vault-1 Started container vault
60s Warning BackOff pod/vault-1 Back-off restarting failed container
27m Normal TaintManagerEviction pod/vault-2 Cancelling deletion of Pod vault-foo/vault-2
28m Warning FailedScheduling pod/vault-2 0/4 nodes are available: 1 Insufficient memory, 4 Insufficient cpu.
28m Warning FailedScheduling pod/vault-2 0/5 nodes are available: 1 Insufficient memory, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 4 Insufficient cpu.
27m Normal Scheduled pod/vault-2 Successfully assigned vault-foo/vault-2 to ip-10-101-0-103.ec2.internal
27m Normal SuccessfulAttachVolume pod/vault-2 AttachVolume.Attach succeeded for volume "pvc-fb91141d-ebd9-4767-b122-da8c98349cba"
27m Normal SuccessfulAttachVolume pod/vault-2 AttachVolume.Attach succeeded for volume "pvc-95effe76-6e01-49ad-9bec-14e091e1a334"
27m Normal Pulling pod/vault-2 Pulling image "hashicorp/vault-enterprise:1.5.0_ent"
27m Normal Pulled pod/vault-2 Successfully pulled image "hashicorp/vault-enterprise:1.5.0_ent"
26m Normal Created pod/vault-2 Created container vault
26m Normal Started pod/vault-2 Started container vault
26m Normal Pulled pod/vault-2 Container image "hashicorp/vault-enterprise:1.5.0_ent" already present on machine
7m26s Warning BackOff pod/vault-2 Back-off restarting failed container
2m36s Warning FailedScheduling pod/vault-2 0/7 nodes are available: 1 Insufficient memory, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable, 4 Insufficient cpu.
114s Warning FailedScheduling pod/vault-2 0/8 nodes are available: 1 Insufficient memory, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable, 4 Insufficient cpu.
104s Warning FailedScheduling pod/vault-2 0/9 nodes are available: 1 Insufficient memory, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable, 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 4 Insufficient cpu.
93s Normal Scheduled pod/vault-2 Successfully assigned vault-foo/vault-2 to ip-10-101-0-82.ec2.internal
88s Normal SuccessfulAttachVolume pod/vault-2 AttachVolume.Attach succeeded for volume "pvc-fb91141d-ebd9-4767-b122-da8c98349cba"
88s Normal SuccessfulAttachVolume pod/vault-2 AttachVolume.Attach succeeded for volume "pvc-95effe76-6e01-49ad-9bec-14e091e1a334"
83s Normal Pulling pod/vault-2 Pulling image "hashicorp/vault-enterprise:1.5.0_ent"
81s Normal Pulled pod/vault-2 Successfully pulled image "hashicorp/vault-enterprise:1.5.0_ent"
38s Normal Created pod/vault-2 Created container vault
37s Normal Started pod/vault-2 Started container vault
38s Normal Pulled pod/vault-2 Container image "hashicorp/vault-enterprise:1.5.0_ent" already present on machine
4s Warning BackOff pod/vault-2 Back-off restarting failed container
2m38s Normal Scheduled pod/vault-agent-injector-d54bdc675-qwsmz Successfully assigned vault-foo/vault-agent-injector-d54bdc675-qwsmz to ip-10-101-2-91.ec2.internal
2m37s Normal Pulling pod/vault-agent-injector-d54bdc675-qwsmz Pulling image "hashicorp/vault-k8s:latest"
2m36s Normal Pulled pod/vault-agent-injector-d54bdc675-qwsmz Successfully pulled image "hashicorp/vault-k8s:latest"
2m36s Normal Created pod/vault-agent-injector-d54bdc675-qwsmz Created container sidecar-injector
2m35s Normal Started pod/vault-agent-injector-d54bdc675-qwsmz Started container sidecar-injector
28m Normal Scheduled pod/vault-agent-injector-d54bdc675-wz9ws Successfully assigned vault-foo/vault-agent-injector-d54bdc675-wz9ws to ip-10-101-0-87.ec2.internal
28m Normal Pulled pod/vault-agent-injector-d54bdc675-wz9ws Container image "hashicorp/vault-k8s:latest" already present on machine
28m Normal Created pod/vault-agent-injector-d54bdc675-wz9ws Created container sidecar-injector
28m Normal Started pod/vault-agent-injector-d54bdc675-wz9ws Started container sidecar-injector
3m22s Normal Killing pod/vault-agent-injector-d54bdc675-wz9ws Stopping container sidecar-injector
3m22s Warning Unhealthy pod/vault-agent-injector-d54bdc675-wz9ws Readiness probe failed: Get https://10.101.0.73:8080/health/ready: dial tcp 10.101.0.73:8080: connect: connection refused
3m18s Warning Unhealthy pod/vault-agent-injector-d54bdc675-wz9ws Liveness probe failed: Get https://10.101.0.73:8080/health/ready: dial tcp 10.101.0.73:8080: connect: no route to host
28m Normal SuccessfulCreate replicaset/vault-agent-injector-d54bdc675 Created pod: vault-agent-injector-d54bdc675-wz9ws
2m38s Normal SuccessfulCreate replicaset/vault-agent-injector-d54bdc675 Created pod: vault-agent-injector-d54bdc675-qwsmz
28m Normal ScalingReplicaSet deployment/vault-agent-injector Scaled up replica set vault-agent-injector-d54bdc675 to 1
2m38s Normal ScalingReplicaSet deployment/vault-agent-injector Scaled up replica set vault-agent-injector-d54bdc675 to 1
28m Normal EnsuringLoadBalancer service/vault-ui Ensuring load balancer
28m Normal EnsuredLoadBalancer service/vault-ui Ensured load balancer
26m Normal UpdatedLoadBalancer service/vault-ui Updated load balancer with new hosts
3m24s Normal DeletingLoadBalancer service/vault-ui Deleting load balancer
3m23s Warning PortNotAllocated service/vault-ui Port 32476 is not allocated; repairing
3m23s Warning ClusterIPNotAllocated service/vault-ui Cluster IP 172.20.216.143 is not allocated; repairing
3m22s Warning FailedToUpdateEndpointSlices service/vault-ui Error updating Endpoint Slices for Service vault-foo/vault-ui: failed to update vault-ui-crtg4 EndpointSlice for Service vault-foo/vault-ui: Operation cannot be fulfilled on endpointslices.discovery.k8s.io "vault-ui-crtg4": the object has been modified; please apply your changes to the latest version and try again
3m16s Warning FailedToUpdateEndpoint endpoints/vault-ui Failed to update endpoint vault-foo/vault-ui: Operation cannot be fulfilled on endpoints "vault-ui": the object has been modified; please apply your changes to the latest version and try again
2m52s Normal DeletedLoadBalancer service/vault-ui Deleted load balancer
2m39s Normal EnsuringLoadBalancer service/vault-ui Ensuring load balancer
2m36s Normal EnsuredLoadBalancer service/vault-ui Ensured load balancer
96s Normal UpdatedLoadBalancer service/vault-ui Updated load balancer with new hosts
28m Normal NoPods poddisruptionbudget/vault No matching pods found
28m Normal SuccessfulCreate statefulset/vault create Pod vault-0 in StatefulSet vault successful
28m Normal SuccessfulCreate statefulset/vault create Pod vault-1 in StatefulSet vault successful
28m Normal SuccessfulCreate statefulset/vault create Pod vault-2 in StatefulSet vault successful
2m40s Normal NoPods poddisruptionbudget/vault No matching pods found
2m38s Normal SuccessfulCreate statefulset/vault create Pod vault-0 in StatefulSet vault successful
2m38s Normal SuccessfulCreate statefulset/vault create Pod vault-1 in StatefulSet vault successful
2m38s Normal SuccessfulCreate statefulset/vault create Pod vault-2 in StatefulSet vault successful
And this is my helm :
# Vault Helm Chart Value Overrides
global:
enabled: true
tlsDisable: false
injector:
enabled: true
# Use the Vault K8s Image https://github.com/hashicorp/vault-k8s/
image:
repository: "hashicorp/vault-k8s"
tag: "latest"
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 256Mi
cpu: 250m
server:
# Use the Enterprise Image
image:
repository: "hashicorp/vault-enterprise"
tag: "1.5.0_ent"
# These Resource Limits are in line with node requirements in the
# Vault Reference Architecture for a Small Cluster
resources:
requests:
memory: 8Gi
cpu: 2000m
limits:
memory: 16Gi
cpu: 2000m
# For HA configuration and because we need to manually init the vault,
# we need to define custom readiness/liveness Probe settings
readinessProbe:
enabled: true
path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204"
livenessProbe:
enabled: true
path: "/v1/sys/health?standbyok=true"
initialDelaySeconds: 60
# extraEnvironmentVars is a list of extra environment variables to set with the stateful set. These could be
# used to include variables required for auto-unseal.
extraEnvironmentVars:
VAULT_CACERT: /vault/userconfig/vault-server-tls/vault.ca
# extraVolumes is a list of extra volumes to mount. These will be exposed
# to Vault in the path .
#extraVolumes:
# - type: secret
# name: tls-server
# - type: secret
# name: tls-ca
# - type: secret
# name: kms-creds
extraVolumes:
- type: secret
name: vault-server-tls
# This configures the Vault Statefulset to create a PVC for audit logs.
# See https://www.vaultproject.io/docs/audit/index.html to know more
auditStorage:
enabled: true
standalone:
enabled: false
# Run Vault in "HA" mode.
ha:
enabled: true
replicas: 3
raft:
enabled: true
setNodeId: true
config: |
ui = true
listener "tcp" {
address = "[::]:8200"
cluster_address = "[::]:8201"
tls_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
tls_key_file = "/vault/userconfig/vault-server-tls/vault.key"
tls_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
}
storage "raft" {
path = "/vault/data"
retry_join {
leader_api_addr = "http://vault-0.vault-internal:8200"
leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
}
retry_join {
leader_api_addr = "http://vault-1.vault-internal:8200"
leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
}
retry_join {
leader_api_addr = "http://vault-2.vault-internal:8200"
leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
}
}
service_registration "kubernetes" {}
# Vault UI
ui:
enabled: true
serviceType: "LoadBalancer"
serviceNodePort: null
externalPort: 8200
# For Added Security, edit the below
#loadBalancerSourceRanges:
# - < Your IP RANGE Ex. 10.0.0.0/16 >
# - < YOUR SINGLE IP Ex. 1.78.23.3/32 >
what did I not configure right?
Upvotes: 2
Views: 9064
Reputation: 13898
There are several issue here and they are all represented by the error messages like:
0/9 nodes are available: 1 Insufficient memory, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable, 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 4 Insufficient cpu.
You got 9 Nodes but none of them are available for scheduling due to a different set of conditions. Note that each Node can be affected by multiple issues and so the numbers can add up to more than what you have on total nodes.
Let's break them down one by one:
Insufficient memory
: Execute kubectl describe node <node-name>
to check how much free memory is available there. Check the requests and limits of your pods. Note that Kubernetes will block the full amount of memory a pod requests regardless how much this pod uses.
Insufficient cpu
: Analogical as above.
node(s) didn't match pod affinity/anti-affinity
: Check your affinity/anti-affinity rules.
node(s) didn't satisfy existing pods anti-affinity rules
: Same as above.
node(s) had volume node affinity conflict
: Happens when pod was not able to be scheduled because it cannot connect to the volume from another Availability Zone. You can fix this by creating a storageclass
for a single zone and than use that storageclass
in your PVC.
node(s) were unschedulable
: This is because the node is marked as Unschedulable
. Which leads us to the next issue below:
node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate
: This corresponds to the NodeCondition
Ready
= False
. You can use kubectl describe node to check taints and kubectl taint nodes <node-name> <taint-name>-
in order to remove them. Check the Taints and Tolerations for more details.
Also there is a GitHub thread with a similar issue that you may find useful.
Try checking/eliminating those issue one by one (starting from the first listed above) as they can make a "chain reaction" in some scenarios.
Upvotes: 7