Reputation: 134
I have a kubernetes job (with parallelism: 50) running on GKE Autopilot cluster, that needs more storage than maximum ephemeral storage provisioned by Autopilot cluster per node (i.e. 10Gi). As I need ReadWriteMany
access for pods on the storage, I decided on GCP Filestore (though it would've been nice if minimum instance size for Filestore was less than 1 TiB) for creating PVC that can be mounted on job pods, but job pods are stuck in ContainerCreating state and looking at the event logs, MountVolume.MountDevice failure seems to be the reason:
Warning FailedScheduling 11m gke.io/optimize-utilization-scheduler 0/12 nodes are available: 11 Insufficient memory, 12 Insufficient cpu. preemption: 0/12 nodes are available: 12 No preemption victims found for incoming pod..
Normal TriggeredScaleUp 11m cluster-autoscaler pod triggered scale-up
Normal Scheduled 6m39s gke.io/optimize-utilization-scheduler Successfully assigned default/mypod-7l5k9 to gk3-mycluster-3-e79620bd-jvsg
Warning FailedMount 4m8s (x6 over 4m39s) kubelet MountVolume.MountDevice failed for volume "pvc-435bf565-25f0-43f7-86d4-b3ecadce43a3" : rpc error: code = Aborted desc = An operation with the given volume key modeInstance/asia-northeast1-b/pvc-435bf565-25f0-43f7-86d4-b3ecadce43a3/vol1 already exists.
--- Most likely a long process is still running to completion. Retrying.
Warning FailedMount 2m19s kubelet Unable to attach or mount volumes: unmounted volumes=[my-mounted-storage], unattached volumes=[kube-api-access-4gs6h shared-storage]: timed out waiting for the condition
Warning FailedMount 96s (x2 over 4m39s) kubelet MountVolume.MountDevice failed for volume "pvc-435bf565-25f0-43f7-86d4-b3ecadce43a3" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedMount 5s (x2 over 4m36s) kubelet Unable to attach or mount volumes: unmounted volumes=[my-mounted-storage], unattached volumes=[my-mounted-storage kube-api-access-4gs6h]: timed out waiting for the condition
Here's my PVC and Job manifest:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: podpvc
spec:
accessModes:
- ReadWriteMany
storageClassName: standard-rwx
resources:
requests:
storage: 1Ti
apiVersion: batch/v1
kind: Job
metadata:
name: mypod
labels:
app.kubernetes.io/name: mypod
spec:
parallelism: 50
template:
metadata:
name: mypod
spec:
serviceAccountName: workload-identity-sa
volumes:
- name: my-mounted-storage
persistentVolumeClaim:
claimName: podpvc
containers:
- name: mypod-container
image: mypod-image:staging-0.1
imagePullPolicy: Always
env:
- name: env
value: "stg"
resources:
requests:
cpu: "4"
memory: "16Gi"
volumeMounts:
- name: my-mounted-storage
mountPath: /mnt/data
restartPolicy: OnFailure
Both PV and PVC seems to be healthy and bound, and there doesn't seem to be any existing volume attachments on the nodes (kubectl describe nodes | grep Attach
). I've also tried deleting both the PVC and job, and recreating them but the issue persists.
Upvotes: 0
Views: 1111
Reputation: 3256
Below checkpoints can help you to resolve your issue:
1. Checking if the filestore is in default network:
Check if the GKE cluster and filestore are created under a non-default network, and use the GKE supported storageClasses: standard-rwx, enterprise-rwx, premium-rwx, which you can find in the networking section of cluster. This would cause the Filestore instance to provision in a default network. This results in the mount failing as Filestore (default network) cannot be mounted on the nodes (non-default network).
To resolve this issue, you need to specify the network parameter for the Filestore mount to match the network of the GKE cluster by adding the storageclass.parameters.network field as follows:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: filestore-example
provisioner: filestore.csi.storage.gke.io
volumeBindingMode: Immediate
allowVolumeExpansion: true
parameters:
tier: standard
network: default
2. Check the IP addresses:
Check if the IP address of the Filestore and the IP address present in the PVC are different. The PVC should contain the IP address of the filestore and the name of the filestore. If they are different, try editing the YAML file and setting the correct IP address in the PVC.
For more information follow this document.
Upvotes: 3