alpe1
alpe1

Reputation: 340

Auto delete PVC when scaling down?

I am looking for a way to delete PersistentVolumeClaims assigned to pods of a StatefulSet automatically when I scale down the number of instances. Is there a way to do this within k8s? I haven't found anything in the docs, yet.

Upvotes: 1

Views: 2672

Answers (2)

rim
rim

Reputation: 1

facing this problem I ended up dropping StatefulSet and implemented my own "custom-controller" that in essence patches a ownerReference into each PersistentVolumeClaim so it will get garbage collected once the Pod that used it is gone.

This "custom-controller" is implemented as shell script residing in a ConfigMap and is kept running by a simple ReplicationController. Note that it needs permission to use kubectl.

Looks like so (shortened to scope of this question):

---
# make one such map per runner
apiVersion: v1
kind: ConfigMap
metadata:
  name: runner01
  namespace: myrunners
  labels:
    myrunner: "true"
data:
  # whatever env runner01 needs
  RUNNER_UUID: "{...}"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: scripts
  namespace: myrunners
data:
  # this is the "custom-controller"
  runner-controller.sh: |
    #!/bin/bash
    echo "pod $HOSTNAME started"
    #
    [ ! -d /etc/yaml ] && echo "error: /etc/yaml/ not existing">/dev/stderr && exit 1
    #
    function runner {
       name=$1
       ns=$(grep 'namespace:' $name.yaml | head -1 | cut -d ':' -f2 | sed 's/\s//g')
       while true; do
          echo "--> starting pod $name in namespace=$ns..."
          kubectl apply -f PVCs.$name.yaml
          kubectl apply -f $name.yaml
          # Bind the runners PersistentVolumeClaims to its Pod via
          # ownerReferences, so each PVC gets deleted when the Pod terminates.
          pod_uid=$(kubectl get pod $name -n $ns  -o=jsonpath='{.metadata.uid}')
          PVCs=$(grep claimName $name.yaml | cut -d ':' -f 2 | sed 's/\s//g')
          for pvc in $PVCs; do
             kubectl patch pvc $pvc -n $ns --type='json' -p='[{
               "op": "add",
               "path": "/metadata/ownerReferences",
               "value": [{
                   "apiVersion": "v1",
                   "kind": "Pod",
                   "name": "'"$name"'",
                   "uid": "'"$pod_uid"'",
                   "blockOwnerDeletion": true
               }]
             }]'
          done
          kubectl wait -n $ns --timeout=-1s --for=delete pod $name
          echo "$name pod got terminated, wait for its PVCs to be gone..."
          for pvc in $PVCs; do
             kubectl wait -n $ns --timeout=-1s --for=delete pvc $pvc &
          done
          wait
          echo "$name terminated"
       done
    }
    #
    function keep_runners_running {
       # note: pass all runner names as params
       echo "observing runners..."
       while true; do
          sleep 5
          alive_runners=0
          for job in $(jobs -p); do
            if ! kill -0 $job 2>/dev/null; then
               echo "runner $job has exited, restarting pod ${@[$job]}..."
               runner ${@[$job]} &
            else
               ((alive_runners++))
            fi
          done
       done
    }
    #
    # --- main
    cd /etc/yaml/
    RUNNERS=$(kubectl -n myrunners get configmap -l myrunner=true -o name | awk  -F/ '{print $2}' ORS=' ')
    echo "found configMaps for runners: $RUNNERS"
    echo "starting runners..."
    for name in $RUNNERS; do
       runner $name &   # have bash keep it as background job
    done
    #
    trap 'echo "controller was asked to terminate, exiting..."; jobs -p | xargs -r kill; exit;' SIGINT SIGTERM
    #
    keep_runners_running $RUNNERS &
    wait # forever
    #
---
# -- Runner Pods
# - each runner is a Pod with PersistentVolumeClaim(s)
# - provide one "runnerXX.yaml" + "PVCs.runnerXX.yaml" pair for each runner
apiVersion: v1
kind: ConfigMap
metadata:
  name: yaml
  namespace: myrunners
data:
  runner01.yaml: |
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: runner01
      namespace: myrunners
      labels:
        app: myrunner
    spec:
      containers:
        - name: runner
          image: busybox
          command: ['/bin/sh', '-c', 'echo "I am runner $HOSTNAME"; sleep 300;']
          volumeMounts:
            - name: workdir
              mountPath: /var/tmp
      volumes:
        - name: workdir
          persistentVolumeClaim:
            claimName: runner01-workdir
    ---
  PVCs.runner01.yaml: |
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: runner01-workdir
      namespace: myrunners
      labels:
        app: myrunner
        runner: runner01
    spec:
      accessModes:
        - ReadWriteOnce
      storageClassName: directpv-min-io
      resources:
        requests:
          storage: 1Gi
    ---
---
# have the "customer-controller" be running all the time
apiVersion: v1
kind: ReplicationController
metadata:
  name: controller
  namespace: myrunners
  labels:
    app: controller
spec:
  replicas: 1
  selector:
    app: runner-controller.sh
  template:
    metadata:
      name: controller
      namespace: myrunners
      labels:
        app: runner-controller.sh
    spec:
      serviceAccount: mykubectl
      containers:
        - name: controller
          imagePullPolicy: IfNotPresent
          image: bitnami/kubectl
          command: ['/etc/scripts/runner-controller.sh']
          volumeMounts:
            - name: scripts
              mountPath: /etc/scripts
            - name: yaml
              mountPath: /etc/yaml
      volumes:
        - name: scripts
          configMap:
            name: scripts
            defaultMode: 0555
        - name: yaml
          configMap:
            name: yaml
            defaultMode: 0444
---

Why that makes sense:

the PersistentVolume created here via claim resides on local hard disk of the node the Pod gets placed on. This creates a stickiness of particular runner to particular node. The only way to not have that is to remove the claim, so the volume gets deleted, which then will make the next new Pod free to be placed on any node as if it never existed before. In essence its the opposite of a StatefulSet regarding storage (also the other pod controllers like Deployment or Job behave the same way when used with a local disk volume manager).

Only useful if the application is stateless regarding its disk use.

Upvotes: 0

mdaniel
mdaniel

Reputation: 33231

I suspect that a preStop Lifecycle Handler could submit a Job to clean up the PVC, assuming the Pod's ServiceAccount had the Role to do so. Unfortunately, the Lifecycle Handler docs say that the exec blocks the Pod deletion, so that's why whatever happened would need to be asynchronous from the Pod's perspective.

Another approach might be to unconditionally scan the cluster or namespace with a CronJob and delete unassigned PVCs, or those that match a certain selector.

But I don't think there is any inherent ability to do that, given that (at least in my own usage) it's reasonable to scale a StatefulSet up and down, and when scaling it back up then one would actually desire that the Pod regain its identity in the StatefulSet, which typically includes any persisted data.

Upvotes: 3

Related Questions