Justin Lloyd
Justin Lloyd

Reputation: 123

GKE Kubernetes Job unable to use ephemeral volume in Autopilot Cluster

I am trying to set up a Job on an autopilot GKE cluster.

The job is used to restore database backups, so it needs to be able to download and decompress very large files (around 50 - 100Gi).

However, autopilot pods have a limit of 10Gi so I followed this guide to be able to use an ephemeral volume instead:

https://cloud.google.com/kubernetes-engine/docs/how-to/generic-ephemeral-volumes

I have confirmed that the volume is indeed available to the pod by using the command:

kubectl exec -it deploy/ephemeral-deployment -- bash

So the volume is being created, mounted, and available to the Job, giving it the 100Gi of space it needs. Despite this, the Job keeps failing and I get the error message:

Pod ephemeral local storage usage exceeds the total limit of containers 1Gi.

I did some research and found that it is due to the resource limits set in the YAML file:

resources:
      limits:
        cpu: "5"
        ephemeral-storage: 1Gi   <------
        memory: 6Gi
      requests:
        cpu: "5"
        ephemeral-storage: 1Gi
        memory: 6Gi

The problem is, I can't remove the limits. If I create the job without them in the YAML, it automatically puts them in for me. If I increase them, it resets them back to the 10GB limit.

Either way, it's making it so that I can't use the 100GB I have set up on the ephemeral volume. Almost like it's fighting itself.

Is there any way around this?

Upvotes: 0

Views: 442

Answers (1)

x-zone-cat
x-zone-cat

Reputation: 550

It's a new capabilities on GKE autopilot cluster, you can read it on this article.

Things to consider to be able to use higher ephemeral storage on GKE autopilot:

  • upgrade to version 1.28.6-gke.1095000 or later

  • you need to use performance compute class, C3, C3D, etc machine family

  • use sample YAML below as reference:

apiVersion: v1
kind: Pod
metadata:
  name: performance-pod
spec:
  nodeSelector:
    cloud.google.com/compute-class: Performance
    cloud.google.com/machine-family: c3d
  containers:
  - name: my-container
    image: "k8s.gcr.io/pause"
    resources:
      requests:
        cpu: 4
        memory: "16Gi"
        ephemeral-storage: 100Gi

Upvotes: 0

Related Questions