Reputation: 4868
In my Kubernetes cluster I want to define a StatefulSet using a local persistence volume on each node. My Kubernetes cluster has worker nodes.
My StatefulSet
looks something like this:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: myset
spec:
replicas: 3
...
template:
spec:
....
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myset
topologyKey: kubernetes.io/hostname
containers:
....
volumeMounts:
- name: datadir
mountPath: /data
volumes:
- name: datadir
persistentVolumeClaim:
claimName: datadir
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: "local-storage"
resources:
requests:
storage: 10Gi
I want to achieve, that on each POD, running on a separate node, a local data volume is used.
I defined a StorageClass
object:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
and the following PersistentVolume
:
apiVersion: v1
kind: PersistentVolume
metadata:
name: datadir
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /var/lib/my-data/
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-node-1
But of course, this did not work as I have defined a nodeAffinity
with only the hostname for my first worker-node-1. As a result I can see only one PV. The PVC and the POD on the corresponding node starts as expected. But on the other two nodes I have no PVs. How can I define, that a local PersistenceVolume
is created for each worker-node?
I also tried to define a nodeAffinity with 3 values:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-node-1
- worker-node-2
- worker-node-3
But this also did not work.
Upvotes: 4
Views: 2969
Reputation: 11098
I fear that the
PersitenceVolume
I define is the problem. This object will create exactly onePV
and so only one of my PODs finds the correspondingPV
and can be scheduled.
Yes, you're right. By creating PersistentVolume
object, you create exactly one PersistentVolume
. No more, no less. If you define 3 separate PVs
that can be available on each of your 3 nodes, you shouldn't experience any problem.
If you have, let's say, 3 worker nodes, you need to create 3 separate PersistentVolumes
, each one with different NodeAffinity
. You don't need to define any NodeAffinity
in your StatefulSet
as it is already handled on PersistentVolume
level and should be defined only there.
As you can read in the local volume documentation:
Compared to
hostPath
volumes,local
volumes are used in a durable and portable manner without manually scheduling pods to nodes. The system is aware of the volume's node constraints by looking at the node affinity on the PersistentVolume.
Remember: PVC -> PV mapping is always 1:1. You cannot bind 1 PVC to 3 different PVs or the other way.
So my only solution is to switch form local PV to hostPath volumes which is working fine.
Yes, it can be done with hostpath
but I wouldn't say it is the only and the best solution. Local volumes have several advantages over hostpath volumes and it is worth considering choosing them. But as I mentioned above, in your use case you need to create 3 separate PVs
manually. You already created one PV
so it shouldn't be a big deal to create another two. This is the way to go.
I want to achieve, that on each POD, running on a separate node, a local data volume is used.
It can be achieved with local volumes but in such case instead of using a single PVC in your StatefulSet
definition as in the below fragment from your configuration:
volumes:
- name: datadir
persistentVolumeClaim:
claimName: datadir
you need to use only volumeClaimTemplates
as in this example, which may look as follows:
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi
As you can see, the PVCs
won't "look" for a PV
with any particular name so you can name them as you wish. They will "look" for a PV
belonging to a particular StorageClass
and in this particular case supporting "ReadWriteOnce"
accessMode
.
The scheduler will attempt to find the adequate node, on which your stateful pod can be scheduled. If another pod was already scheduled, let's say, on worker-1
and the only PV
belonging to our local-storage
storage class isn't available any more, the scheduler will try to find another node that meets storage requirements. So again: no need for node affinity/ pod antiaffinity rules in your StatefulSet
definition.
But I need some mechanism that a PV is created for each node and assigned with the PODs created by the StatefulSet. But this did not work - I always have only one PV.
In order to facilitate the management of volumes and automate the whole process to certain extent, take a look at Local Persistence Volume Static Provisioner. As its name may already suggest, it doesn't support dynamic provisioning (as we have e.g. on various cloud platforms), which means you are still responsible for creating the underlying storage but the whole volume lifecycle can be handled automatically.
To make this whole theoretical explanation somewhat more practical, I'm adding below a working example, which you can quickly test for yourself. Make sure /var/tmp/test
directory is created on every nodes or adjust the below examples to your needs:
StatefulSet
components (slightly modified example from here):
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "local-storage"
resources:
requests:
storage: 1Gi
StorageClass
definition:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
And finally a PV
. You need to make 3 versions of the below yaml manifest by setting different names e.g. example-pv-1
,example-pv-2
and example-pv-3
and node names.
apiVersion: v1
kind: PersistentVolume
metadata:
name: example-pv-1 ### 👈 change it
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: local-storage
local:
path: /var/tmp/test ### 👈 you can adjust shared directory on the node
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-node-1 ### 👈 change this value by setting your node name
So 3 different PVs
for 3 worker nodes.
Upvotes: 8
Reputation: 81
Instead of using a nodeAffinity in the PVC definition, I suggest using an podAntiAffinity rule in the statefulset definition to deploy your application so that no two instances are located on the same host
So you will have a statefulset definition similar to this:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: myset
spec:
replicas: 3
...
template:
metadata:
labels:
sts: myset
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: sts
operator: In
values:
- myset
topologyKey: "kubernetes.io/hostname"
containers:
....
volumeMounts:
- name: datadir
mountPath: /data
volumes:
- name: datadir
persistentVolumeClaim:
claimName: datadir
Reference: An example of a pod that uses pod affinity
Upvotes: 0