Reputation: 340
I am looking for a way to delete PersistentVolumeClaims assigned to pods of a StatefulSet automatically when I scale down the number of instances. Is there a way to do this within k8s? I haven't found anything in the docs, yet.
Upvotes: 1
Views: 2672
Reputation: 1
facing this problem I ended up dropping StatefulSet
and implemented my own "custom-controller" that in essence patches a ownerReference into each PersistentVolumeClaim
so it will get garbage collected once the Pod
that used it is gone.
This "custom-controller" is implemented as shell script residing in a ConfigMap
and is kept running by a simple ReplicationController
. Note that it needs permission to use kubectl
.
Looks like so (shortened to scope of this question):
---
# make one such map per runner
apiVersion: v1
kind: ConfigMap
metadata:
name: runner01
namespace: myrunners
labels:
myrunner: "true"
data:
# whatever env runner01 needs
RUNNER_UUID: "{...}"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: scripts
namespace: myrunners
data:
# this is the "custom-controller"
runner-controller.sh: |
#!/bin/bash
echo "pod $HOSTNAME started"
#
[ ! -d /etc/yaml ] && echo "error: /etc/yaml/ not existing">/dev/stderr && exit 1
#
function runner {
name=$1
ns=$(grep 'namespace:' $name.yaml | head -1 | cut -d ':' -f2 | sed 's/\s//g')
while true; do
echo "--> starting pod $name in namespace=$ns..."
kubectl apply -f PVCs.$name.yaml
kubectl apply -f $name.yaml
# Bind the runners PersistentVolumeClaims to its Pod via
# ownerReferences, so each PVC gets deleted when the Pod terminates.
pod_uid=$(kubectl get pod $name -n $ns -o=jsonpath='{.metadata.uid}')
PVCs=$(grep claimName $name.yaml | cut -d ':' -f 2 | sed 's/\s//g')
for pvc in $PVCs; do
kubectl patch pvc $pvc -n $ns --type='json' -p='[{
"op": "add",
"path": "/metadata/ownerReferences",
"value": [{
"apiVersion": "v1",
"kind": "Pod",
"name": "'"$name"'",
"uid": "'"$pod_uid"'",
"blockOwnerDeletion": true
}]
}]'
done
kubectl wait -n $ns --timeout=-1s --for=delete pod $name
echo "$name pod got terminated, wait for its PVCs to be gone..."
for pvc in $PVCs; do
kubectl wait -n $ns --timeout=-1s --for=delete pvc $pvc &
done
wait
echo "$name terminated"
done
}
#
function keep_runners_running {
# note: pass all runner names as params
echo "observing runners..."
while true; do
sleep 5
alive_runners=0
for job in $(jobs -p); do
if ! kill -0 $job 2>/dev/null; then
echo "runner $job has exited, restarting pod ${@[$job]}..."
runner ${@[$job]} &
else
((alive_runners++))
fi
done
done
}
#
# --- main
cd /etc/yaml/
RUNNERS=$(kubectl -n myrunners get configmap -l myrunner=true -o name | awk -F/ '{print $2}' ORS=' ')
echo "found configMaps for runners: $RUNNERS"
echo "starting runners..."
for name in $RUNNERS; do
runner $name & # have bash keep it as background job
done
#
trap 'echo "controller was asked to terminate, exiting..."; jobs -p | xargs -r kill; exit;' SIGINT SIGTERM
#
keep_runners_running $RUNNERS &
wait # forever
#
---
# -- Runner Pods
# - each runner is a Pod with PersistentVolumeClaim(s)
# - provide one "runnerXX.yaml" + "PVCs.runnerXX.yaml" pair for each runner
apiVersion: v1
kind: ConfigMap
metadata:
name: yaml
namespace: myrunners
data:
runner01.yaml: |
---
apiVersion: v1
kind: Pod
metadata:
name: runner01
namespace: myrunners
labels:
app: myrunner
spec:
containers:
- name: runner
image: busybox
command: ['/bin/sh', '-c', 'echo "I am runner $HOSTNAME"; sleep 300;']
volumeMounts:
- name: workdir
mountPath: /var/tmp
volumes:
- name: workdir
persistentVolumeClaim:
claimName: runner01-workdir
---
PVCs.runner01.yaml: |
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: runner01-workdir
namespace: myrunners
labels:
app: myrunner
runner: runner01
spec:
accessModes:
- ReadWriteOnce
storageClassName: directpv-min-io
resources:
requests:
storage: 1Gi
---
---
# have the "customer-controller" be running all the time
apiVersion: v1
kind: ReplicationController
metadata:
name: controller
namespace: myrunners
labels:
app: controller
spec:
replicas: 1
selector:
app: runner-controller.sh
template:
metadata:
name: controller
namespace: myrunners
labels:
app: runner-controller.sh
spec:
serviceAccount: mykubectl
containers:
- name: controller
imagePullPolicy: IfNotPresent
image: bitnami/kubectl
command: ['/etc/scripts/runner-controller.sh']
volumeMounts:
- name: scripts
mountPath: /etc/scripts
- name: yaml
mountPath: /etc/yaml
volumes:
- name: scripts
configMap:
name: scripts
defaultMode: 0555
- name: yaml
configMap:
name: yaml
defaultMode: 0444
---
Why that makes sense:
the PersistentVolume
created here via claim resides on local hard disk of the node the Pod
gets placed on. This creates a stickiness of particular runner to particular node. The only way to not have that is to remove the claim, so the volume gets deleted, which then will make the next new Pod free to be placed on any node as if it never existed before. In essence its the opposite of a StatefulSet
regarding storage (also the other pod controllers like Deployment
or Job
behave the same way when used with a local disk volume manager).
Only useful if the application is stateless regarding its disk use.
Upvotes: 0
Reputation: 33231
I suspect that a preStop
Lifecycle Handler could submit a Job
to clean up the PVC, assuming the Pod's ServiceAccount
had the Role
to do so. Unfortunately, the Lifecycle Handler docs say that the exec
blocks the Pod deletion, so that's why whatever happened would need to be asynchronous from the Pod's perspective.
Another approach might be to unconditionally scan the cluster or namespace with a CronJob
and delete unassigned PVCs, or those that match a certain selector.
But I don't think there is any inherent ability to do that, given that (at least in my own usage) it's reasonable to scale a StatefulSet
up and down, and when scaling it back up then one would actually desire that the Pod
regain its identity in the StatefulSet
, which typically includes any persisted data.
Upvotes: 3