user19238163
user19238163

Reputation:

My my Postgres-operator Pods dont transfer to new node if one node is failed?

I have a k8s cluster with one master and 3 worker nodes. I have set up crunchy operator high availability with 2 replica sets. This is my deployment file.

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: hippo-ha
spec:
  service:
    type: LoadBalancer
  patroni:
    dynamicConfiguration:
      synchronous_mode: true
      postgresql:
        parameters:
          synchronous_commit: "on"
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-14.6-2
  postgresVersion: 14
  instances:
    - name: pgha1
      replicas: 2
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                postgres-operator.crunchydata.com/cluster: hippo-ha
                postgres-operator.crunchydata.com/instance-set: pgha1
          #- weight: 1
            #podAffinityTerm:
              #topologyKey: kubernetes.io/hostname
              #labelSelector:
                #matchLabels:
                  #postgres-operator.crunchydata.com/cluster: hippo-ha
                  #postgres-operator.crunchydata.com/instance-set: pgha1
  
  monitoring:
    pgmonitor:
      exporter:
       image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.3.0-0
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.41-2
      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 1Gi
  proxy:
    pgBouncer:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbouncer:ubi8-1.17-5
      replicas: 2
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchLabels:
                  postgres-operator.crunchydata.com/cluster: hippo-ha
                  postgres-operator.crunchydata.com/role: pgbouncer

This deeply pods on 2 different nodes as expected. One is the primary pod and the other is a replica pod. So far so good.

 masterk8s@-machine:~/postgres-operator-examples-3$ kubectl get pods
NAME                                    READY   STATUS      RESTARTS       AGE
crunchy-alertmanager-5cd75b4f75-m6k5l   1/1     Running     0              79m
crunchy-grafana-64b9f9dcc-kl9f7         1/1     Running     1 (74m ago)    79m
crunchy-prometheus-dc4cbff87-hspst      0/1     Running     1 (74m ago)    79m
hippo-ha-backup-478f-svf6j              0/1     Completed   0              92m
hippo-ha-pgbouncer-7b5f679db4-glj7s     2/2     Running     2 (106m ago)   142m
hippo-ha-pgbouncer-7b5f679db4-z74zx     2/2     Running     0              142m
hippo-ha-pgha1-5v9l-0                   5/5     Running     0              18m
hippo-ha-pgha1-ltb2-0                   5/5     Running     0              63m
hippo-ha-repo-host-0                    2/2     Running     4 (62m ago)    142m
pgo-7c867985c-cwbgp                     1/1     Running     0              152m
pgo-upgrade-69b5dfdc45-xjdxt            1/1     Running     0              152m

What Problem I faced: Now I check for the primary pod of Postgres-Operator and it was running on worker node 3. I shut down worker-node3 to check the failover.

Result: All pods of worker-node3 are stuck in the termination state.

What I am expecting: It supposes to deploy all the pods of worker-node-3 to the available other 2 nodes.

Problem: As the pods are in a terminating state I am not able to make any connection to the database and data cannot be fetched or posted. It completely fails the high availability test.

What I have done: I try preferredDuringSchedulingIgnoredDuringExecution: and requiredDuringSchedulingIgnoredDuringExecution: as shown in commented codes. In both cases, pods are stuck in a termination state and I am not able to access the database.

I am sure I missed something but I am not able to find out the mistake. Can you please help me to find the issue? Why pods are not being redeployed and creating their required sets of replicas? It will be a great help. Thanks.

Upvotes: 0

Views: 523

Answers (0)

Related Questions