Ralph
Ralph

Reputation: 4868

How to define local pesistence volumes in a Kubernetes StatefullSet?

In my Kubernetes cluster I want to define a StatefulSet using a local persistence volume on each node. My Kubernetes cluster has worker nodes.

My StatefulSet looks something like this:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: myset
spec:
  replicas: 3
  ...
  template:
    spec:
     ....
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - myset
              topologyKey: kubernetes.io/hostname
      containers:
     ....
        volumeMounts:
        - name: datadir
          mountPath: /data
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
        - "ReadWriteOnce"
      storageClassName: "local-storage"
      resources:
        requests:
          storage: 10Gi

I want to achieve, that on each POD, running on a separate node, a local data volume is used.

I defined a StorageClass object:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

and the following PersistentVolume:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: datadir
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /var/lib/my-data/
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - worker-node-1

But of course, this did not work as I have defined a nodeAffinity with only the hostname for my first worker-node-1. As a result I can see only one PV. The PVC and the POD on the corresponding node starts as expected. But on the other two nodes I have no PVs. How can I define, that a local PersistenceVolume is created for each worker-node?

I also tried to define a nodeAffinity with 3 values:

  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - worker-node-1
          - worker-node-2
          - worker-node-3

But this also did not work.

Upvotes: 4

Views: 2969

Answers (2)

mario
mario

Reputation: 11098

I fear that the PersitenceVolume I define is the problem. This object will create exactly one PV and so only one of my PODs finds the corresponding PV and can be scheduled.

Yes, you're right. By creating PersistentVolume object, you create exactly one PersistentVolume. No more, no less. If you define 3 separate PVs that can be available on each of your 3 nodes, you shouldn't experience any problem.

If you have, let's say, 3 worker nodes, you need to create 3 separate PersistentVolumes, each one with different NodeAffinity. You don't need to define any NodeAffinity in your StatefulSet as it is already handled on PersistentVolume level and should be defined only there.

As you can read in the local volume documentation:

Compared to hostPath volumes, local volumes are used in a durable and portable manner without manually scheduling pods to nodes. The system is aware of the volume's node constraints by looking at the node affinity on the PersistentVolume.

Remember: PVC -> PV mapping is always 1:1. You cannot bind 1 PVC to 3 different PVs or the other way.

So my only solution is to switch form local PV to hostPath volumes which is working fine.

Yes, it can be done with hostpath but I wouldn't say it is the only and the best solution. Local volumes have several advantages over hostpath volumes and it is worth considering choosing them. But as I mentioned above, in your use case you need to create 3 separate PVs manually. You already created one PV so it shouldn't be a big deal to create another two. This is the way to go.

I want to achieve, that on each POD, running on a separate node, a local data volume is used.

It can be achieved with local volumes but in such case instead of using a single PVC in your StatefulSet definition as in the below fragment from your configuration:

  volumes:
  - name: datadir
    persistentVolumeClaim:
      claimName: datadir

you need to use only volumeClaimTemplates as in this example, which may look as follows:

  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "my-storage-class"
      resources:
        requests:
          storage: 1Gi

As you can see, the PVCs won't "look" for a PV with any particular name so you can name them as you wish. They will "look" for a PV belonging to a particular StorageClass and in this particular case supporting "ReadWriteOnce" accessMode.

The scheduler will attempt to find the adequate node, on which your stateful pod can be scheduled. If another pod was already scheduled, let's say, on worker-1 and the only PV belonging to our local-storage storage class isn't available any more, the scheduler will try to find another node that meets storage requirements. So again: no need for node affinity/ pod antiaffinity rules in your StatefulSet definition.

But I need some mechanism that a PV is created for each node and assigned with the PODs created by the StatefulSet. But this did not work - I always have only one PV.

In order to facilitate the management of volumes and automate the whole process to certain extent, take a look at Local Persistence Volume Static Provisioner. As its name may already suggest, it doesn't support dynamic provisioning (as we have e.g. on various cloud platforms), which means you are still responsible for creating the underlying storage but the whole volume lifecycle can be handled automatically.

To make this whole theoretical explanation somewhat more practical, I'm adding below a working example, which you can quickly test for yourself. Make sure /var/tmp/test directory is created on every nodes or adjust the below examples to your needs:

StatefulSet components (slightly modified example from here):

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 3 # by default is 1
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "local-storage"
      resources:
        requests:
          storage: 1Gi

StorageClass definition:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

And finally a PV. You need to make 3 versions of the below yaml manifest by setting different names e.g. example-pv-1,example-pv-2 and example-pv-3 and node names.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: example-pv-1 ### 👈 change it
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-storage
  local:
    path: /var/tmp/test ### 👈 you can adjust shared directory on the node 
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - worker-node-1 ### 👈 change this value by setting your node name

So 3 different PVs for 3 worker nodes.

Upvotes: 8

Yiadh TLIJANI
Yiadh TLIJANI

Reputation: 81

Instead of using a nodeAffinity in the PVC definition, I suggest using an podAntiAffinity rule in the statefulset definition to deploy your application so that no two instances are located on the same host

So you will have a statefulset definition similar to this:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: myset
spec:
  replicas: 3
  ...
  template:
    metadata:
      labels:
        sts: myset
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: sts
                operator: In
                values:
                - myset
            topologyKey: "kubernetes.io/hostname"
      containers:
     ....
        volumeMounts:
        - name: datadir
          mountPath: /data
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir

Reference: An example of a pod that uses pod affinity

Upvotes: 0

Related Questions