Zalando postgres-operator failed to synchronize database after deployment

Question

I have installed https://github.com/zalando/postgres-operator on my K8S cluster and tried to deploy the PostgreSQL sample "minimal-postgres-manifest.yaml" provided in this git project.

After deployment of the 2 spilo pods, I can't have access to the databases (connection timeout with psql).

When I check the logs of the postgre-operator pod, I can see that this pod fails to connect to the database and can't initialize database.

time="2022-04-15T06:18:46Z" level=debug msg="syncing master service" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:18:46Z" level=debug msg="syncing replica service" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:18:46Z" level=debug msg="No load balancer created for the replica service" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1 time="2022-04 15T06:18:46Z" level=debug msg="syncing volumes using \"pvc\" storage resize mode" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:18:46Z" level=info msg="volume claims do not require changes" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1 time="2022-04-15T06:18:46Z" level=debug msg="syncing statefulsets" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1 time="2022-04-15T06:18:47Z" level=debug msg="making GET http request: http://10.244.2.13:8008/config" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:18:59Z" level=debug msg="making GET http request: http://10.244.1.14:8008/patroni" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:18:59Z" level=debug msg="making GET http request: http://10.244.2.13:8008/patroni" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:18:59Z" level=debug msg="syncing pod disruption budgets" cluster name=default/acid-minimal-cluster pkg=cluster worker=1 W0415 06:18:59.347588       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget time="2022-04-15T06:18:59Z" level=debug msg="syncing roles" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:19:14Z" level=warning msg="could not connect to Postgres database: dial tcp: i/o timeout" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:19:29Z" level=warning msg="could not connect to Postgres database: dial tcp: i/o timeout" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:19:44Z" level=warning msg="could not connect to Postgres database: dial tcp: i/o timeout" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:19:59Z" level=warning msg="could not connect to Postgres database: dial tcp: i/o timeout" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:20:14Z" level=warning msg="could not connect to Postgres database: dial tcp: i/o timeout" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:20:29Z" level=warning msg="could not connect to Postgres database: dial tcp: i/o timeout" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:20:44Z" level=warning msg="could not connect to Postgres database: dial tcp: i/o timeout" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:20:59Z" level=warning msg="could not connect to Postgres database: dial tcp: i/o timeout" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:20:59Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=default/acid-minimal-cluster pkg=cluster worker=1
time="2022-04-15T06:20:59Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=default/acid-minimal-cluster pkg=controller worker=1

When I'm connected into one of the spilo pod with "kubectl exec", I can check my role and database defined in the "minimal-postgres-manifest.yaml" file are not created.

I have deployed Zalando postgres-operator and the postgresql cluster with the QuickStart procedure : https://github.com/zalando/postgres-operator/blob/master/docs/quickstart.md

I have made just 3 changes in the "minimal-postgres-manifest.yaml" file provided : I change the number of replica for 3 to 2, I decrease the database size and I declare a specific storageclass

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: acid-minimal-cluster
  namespace: default
spec:
  teamId: "acid"
  volume:
    size: 500Mi
    storageClass: pg-openebs-sc
  numberOfInstances: 2
  users:
    zalando:  # database owner
    - superuser
    - createdb
    foo_user: []  # role for application foo
  databases:
    foo: zalando  # dbname: owner
  preparedDatabases:
    bar: {}
  postgresql:
    version: "14"

My storageclass is based on OpenEBS but I also tried to use "kubernetes.io/no-provisioner" with the same result. If I check the folders used by the pv, some folders and files are created by th postgresql pods.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pg-openebs-sc
  annotations:
    openebs.io/cas-type: local
    cas.openebs.io/config: |
- name: StorageType
value: hostpath
- name: BasePath
value: /var/lib/postgresql/data
provisioner: openebs.io/local
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

My spilo pods are running, one have master role and the other one has replica role

NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                                  NOMINATED NODE   READINESS GATES   SPILO-ROLE
acid-minimal-cluster-0   1/1     Running   0          15h   10.244.2.13   ctms-prod-vm-prod01-1a-02.prod.outscale.easyconform                          master
acid-minimal-cluster-1   1/1     Running   0          15h   10.244.1.14   ctms-prod-vm-prod01-1a-01.prod.outscale.easyconform                          replica

The end of the master pod logs contains :

2022-04-15 07:05:51,306 INFO: no action. I am (acid-minimal-cluster-0) the leader with the lock
2022-04-15 07:06:01,306 INFO: no action. I am (acid-minimal-cluster-0) the leader with the lock
2022-04-15 07:06:11,307 INFO: no action. I am (acid-minimal-cluster-0) the leader with the lock
2

The end of the replica pod logs contains :

2022-04-15 07:07:11,341 INFO: no action. I am a secondary (acid-minimal-cluster-1) and following a leader (acid-minimal-cluster-0)
2022-04-15 07:07:21,338 INFO: no action. I am a secondary (acid-minimal-cluster-1) and following a leader (acid-minimal-cluster-0)
2022-04-15 07:07:31,320 INFO: no action. I am a secondary (acid-minimal-cluster-1) and following a leader (acid-minimal-cluster-0)

Do you have any ideas to resolve these points?

Zalando postgres-operator failed to synchronize database after deployment

Answers (1)

Related Questions