GKE Autopilot Guardian is incompatible with cpu resource requests in chart

Question

I have a GKE private cluster in autopilot mode running gke1.23, described below. I am trying to install an application from a vendor's helm chart, following their instructions, I use a script like this:

#! /bin/bash
helm repo add safesoftware https://safesoftware.github.io/helm-charts/
helm repo update
tag="2021.2"
version="safesoftware/fmeserver-$tag"

helm upgrade --install \
    fmeserver   \
    $version  \
    --set fmeserver.image.tag=$tag \
    --set deployment.hostname="REDACTED" \
    --set deployment.useHostnameIngress=true \
    --set deployment.tlsSecretName="my-ssl-cert" \
    --namespace ingress-nginx --create-namespace \
    #--set resources.core.requests.cpu="500m" \
    #--set resources.queue.requests.cpu="500m" \

However, I get errors from the GKE Warden!

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "ingress-nginx" chart repository
...Successfully got an update from the "safesoftware" chart repository
Update Complete. ⎈Happy Helming!⎈
W1201 10:25:08.117532   29886 warnings.go:70] Autopilot increased resource requests for Deployment ingress-nginx/engine-standard-group to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.201656   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/fmeserver-postgresql to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.304755   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/core to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.392965   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/queue to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.480421   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/websocket to meet requirements. See http://g.co/gke/autopilot-resources.

Error: UPGRADE FAILED: cannot patch "core" with kind StatefulSet: admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more policies: {"[denied by autogke-pod-limit-constraints]":["workload 'core' cpu requests '{{400 -3} {\u003cnil\u003e}  DecimalSI}' is lower than the Autopilot minimum required of '{{500 -3} {\u003cnil\u003e} 500m DecimalSI}' for using pod anti affinity. Requested by user: 'REDACTED', groups: 'system:authenticated'."]} && cannot patch "queue" with kind StatefulSet: admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more policies: {"[denied by autogke-pod-limit-constraints]":["workload 'queue' cpu requests '{{250 -3} {\u003cnil\u003e}  DecimalSI}' is lower than the Autopilot minimum required of '{{500 -3} {\u003cnil\u003e} 500m DecimalSI}' for using pod anti affinity. Requested by user: 'REDACTED', groups: 'system:authenticated'."]}

So I modified the cpu requests in the resource spec for the pods causing the issues, one way is to uncomment the last two lines of the script.

    --set resources.core.requests.cpu="500m" \
    --set resources.queue.requests.cpu="500m" \

This lets me install or upgrade the chart but then I get PodUnschedulable, Reason Cannot schedule pods: Insufficient cpu. Depending on the exact changes to the chart, I sometimes also see Cannot schedule pods: node(s) had volume node affinity conflict.

I can't see how to increase either the number of pods or size of each (e2-medium) node in autopilot mode. Nor can I find a way to remove those guards. I have checked the quotas and can't see any quota issue. I can install other workloads, including ingress-nginx.

I am not sure what the issue is and I am not a expert with helm or Kubernetes.

For reference, the cluster can be described as:

addonsConfig:
  cloudRunConfig:
    disabled: true
    loadBalancerType: LOAD_BALANCER_TYPE_EXTERNAL
  configConnectorConfig: {}
  dnsCacheConfig:
    enabled: true
  gcePersistentDiskCsiDriverConfig:
    enabled: true
  gcpFilestoreCsiDriverConfig:
    enabled: true
  gkeBackupAgentConfig: {}
  horizontalPodAutoscaling: {}
  httpLoadBalancing: {}
  kubernetesDashboard:
    disabled: true
  networkPolicyConfig:
    disabled: true
autopilot:
  enabled: true
autoscaling:
  autoprovisioningNodePoolDefaults:
    imageType: COS_CONTAINERD
    management:
      autoRepair: true
      autoUpgrade: true
    oauthScopes:
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/service.management.readonly
    - https://www.googleapis.com/auth/servicecontrol
    - https://www.googleapis.com/auth/trace.append
    serviceAccount: default
    upgradeSettings:
      maxSurge: 1
      strategy: SURGE
  autoscalingProfile: OPTIMIZE_UTILIZATION
  enableNodeAutoprovisioning: true
  resourceLimits:
  - maximum: '1000000000'
    resourceType: cpu
  - maximum: '1000000000'
    resourceType: memory
  - maximum: '1000000000'
    resourceType: nvidia-tesla-t4
  - maximum: '1000000000'
    resourceType: nvidia-tesla-a100
binaryAuthorization: {}
clusterIpv4Cidr: 10.102.0.0/21
createTime: '2022-11-30T04:47:19+00:00'
currentMasterVersion: 1.23.12-gke.100
currentNodeCount: 7
currentNodeVersion: 1.23.12-gke.100
databaseEncryption:
  state: DECRYPTED
defaultMaxPodsConstraint:
  maxPodsPerNode: '110'
endpoint: REDACTED
id: REDACTED
initialClusterVersion: 1.23.12-gke.100
initialNodeCount: 1
instanceGroupUrls: REDACTED
ipAllocationPolicy:
  clusterIpv4Cidr: 10.102.0.0/21
  clusterIpv4CidrBlock: 10.102.0.0/21
  clusterSecondaryRangeName: pods
  servicesIpv4Cidr: 10.103.0.0/24
  servicesIpv4CidrBlock: 10.103.0.0/24
  servicesSecondaryRangeName: services
  stackType: IPV4
  useIpAliases: true
labelFingerprint: '05525394'
legacyAbac: {}
location: europe-west3
locations:
- europe-west3-c
- europe-west3-a
- europe-west3-b
loggingConfig:
  componentConfig:
    enableComponents:
    - SYSTEM_COMPONENTS
    - WORKLOADS
loggingService: logging.googleapis.com/kubernetes
maintenancePolicy:
  resourceVersion: 93731cbd
  window:
    dailyMaintenanceWindow:
      duration: PT4H0M0S
      startTime: 03:00
masterAuth:
masterAuthorizedNetworksConfig:
  cidrBlocks:
  enabled: true
monitoringConfig:
  componentConfig:
    enableComponents:
    - SYSTEM_COMPONENTS
monitoringService: monitoring.googleapis.com/kubernetes
name: gis-cluster-uat
network: geo-nw-uat
networkConfig:
nodeConfig:
  diskSizeGb: 100
  diskType: pd-standard
  imageType: COS_CONTAINERD
  machineType: e2-medium
  metadata:
    disable-legacy-endpoints: 'true'
  oauthScopes:
  - https://www.googleapis.com/auth/devstorage.read_only
  - https://www.googleapis.com/auth/logging.write
  - https://www.googleapis.com/auth/monitoring
  - https://www.googleapis.com/auth/service.management.readonly
  - https://www.googleapis.com/auth/servicecontrol
  - https://www.googleapis.com/auth/trace.append
  serviceAccount: default
  shieldedInstanceConfig:
    enableIntegrityMonitoring: true
    enableSecureBoot: true
  workloadMetadataConfig:
    mode: GKE_METADATA
nodePoolAutoConfig: {}
nodePoolDefaults:
  nodeConfigDefaults:
    loggingConfig:
      variantConfig:
        variant: DEFAULT
nodePools:
- autoscaling:
    autoprovisioned: true
    enabled: true
    maxNodeCount: 1000
  config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS_CONTAINERD
    machineType: e2-medium
    metadata:
      disable-legacy-endpoints: 'true'
    oauthScopes:
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/service.management.readonly
    - https://www.googleapis.com/auth/servicecontrol
    - https://www.googleapis.com/auth/trace.append
    serviceAccount: default
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
      enableSecureBoot: true
    workloadMetadataConfig:
      mode: GKE_METADATA
  initialNodeCount: 1
  instanceGroupUrls:
  locations:
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsConstraint:
    maxPodsPerNode: '32'
  name: default-pool
  networkConfig:
    podIpv4CidrBlock: 10.102.0.0/21
    podRange: pods
  podIpv4CidrSize: 26
  selfLink: REDACTED
  status: RUNNING
  upgradeSettings:
    maxSurge: 1
    strategy: SURGE
  version: 1.23.12-gke.100
- autoscaling:
    autoprovisioned: true
    enabled: true
    maxNodeCount: 1000
  config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS_CONTAINERD
    machineType: e2-standard-2
    metadata:
      disable-legacy-endpoints: 'true'
    oauthScopes:
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/service.management.readonly
    - https://www.googleapis.com/auth/servicecontrol
    - https://www.googleapis.com/auth/trace.append
    reservationAffinity:
      consumeReservationType: NO_RESERVATION
    serviceAccount: default
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
      enableSecureBoot: true
    workloadMetadataConfig:
      mode: GKE_METADATA
  instanceGroupUrls:
  locations:
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsConstraint:
    maxPodsPerNode: '32'
  name: nap-1rrw9gqf
  networkConfig:
    podIpv4CidrBlock: 10.102.0.0/21
    podRange: pods
  podIpv4CidrSize: 26
  selfLink: REDACTED
  status: RUNNING
  upgradeSettings:
    maxSurge: 1
    strategy: SURGE
  version: 1.23.12-gke.100
notificationConfig:
  pubsub: {}
privateClusterConfig:
  enablePrivateNodes: true
  masterGlobalAccessConfig:
    enabled: true
  masterIpv4CidrBlock: 192.168.0.0/28
  peeringName: gke-nf69df7b6242412e9932-582a-f600-peer
  privateEndpoint: 192.168.0.2
  publicEndpoint: REDACTED
releaseChannel:
  channel: REGULAR
resourceLabels:
  environment: uat
selfLink: REDACTED
servicesIpv4Cidr: 10.103.0.0/24
shieldedNodes:
  enabled: true
status: RUNNING
subnetwork: redacted
verticalPodAutoscaling:
  enabled: true
workloadIdentityConfig:
  workloadPool: REDACTED
zone: europe-west3

EDIT Adding pod describe logs.

kubectl describe pod core -n ingress-nginx

...
Events:
  Type     Reason     Age                        From     Message
  ----     ------     ----                       ----     -------
  Warning  Unhealthy  6m49s (x86815 over 3d22h)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  BackOff    110s (x13994 over 3d23h)   kubelet  Back-off restarting failed container

kubectl describe pod queue -n ingress-nginx

...
 Events:
  Type     Reason             Age                        From                                   Message
  ----     ------             ----                       ----                                   -------
  Normal   NotTriggerScaleUp  9m29s (x18130 over 2d14h)  cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match pod affinity rules, 3 node(s) had volume node affinity conflict
  Normal   NotTriggerScaleUp  4m28s (x24992 over 2d14h)  cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) had volume node affinity conflict, 2 node(s) didn't match pod affinity rules
  Warning  FailedScheduling   3m33s (x3385 over 2d14h)   gke.io/optimize-utilization-scheduler  0/7 nodes are available: 1 node(s) had volume node affinity conflict, 6 Insufficient cpu.

GKE Autopilot Guardian is incompatible with cpu resource requests in chart

Answers (1)

Related Questions