Jasmine H
Jasmine H

Reputation: 21

postgresql-ha "unable to connect to upstream node" randomly, causing pgpool "kind does not match between main(0) slot[0] (52)"

Environment: kubernetes with istio sidecars injected.

I'm using bitnami/postgresql-ha as a database for my airflow, and randomly seeing the below log in my postgresql statefulset with 3 pods (image: bitnami/postgresql-repmgr:15.3.0-debian-11-r8). Sometimes it appears 10+ times a day, sometimes only once a day, can't find any pattern.

[2023-08-18 02:41:42] [WARNING] unable to ping "user=repmgr password=admin host=airflow-postgresql-1.airflow-postgresql-headless.workflow.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5"
[2023-08-18 02:41:42] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-18 02:41:42] [WARNING] unable to connect to upstream node "airflow-postgresql-1" (ID: 1001)
[2023-08-18 02:41:42] [NOTICE] node "airflow-postgresql-1" (ID: 1001) has recovered, reconnecting
[2023-08-18 02:41:42] [NOTICE] reconnected to upstream node after 0 seconds

Notice: Always reconnected in 0 seconds.

And this could cause pgpool livenessProbe failed, with this event message, causing airflow tasks failed.

Liveness probe failed: Checking pgpool health... 
psql: error: connection to server on socket "/opt/bitnami/pgpool/tmp/.s.PGSQL.5432" 
failed: ERROR: unable to read message kind DETAIL: kind does not match between main(0) slot[0] (52)

I've tried:

  1. Extend the livenessProbe periodSeconds and timeoutSeconds for pgpool, but it doesn't help.
  2. Change pgpool replica count from 2 to 1 pod, but it doesn't help.
  3. set pgHbaTrustAll to true in postgresql, but it doesn't help.
  4. Change postgresql and pgpool image version (tried pgpool 4.3 and 4.4, repmgr 14 and 15), but it doesn't help.
  5. Deploy the same architechture on another k8s cluster, and it still happends.
  6. Turn off pgpool load balancing, but it doesn't help.
  7. Increase the max connection size to 10000, but it doesn't help.

I've check: the resource (cpu/memory) of all related pods are sufficent

Upvotes: 0

Views: 930

Answers (1)

nighthawk
nighthawk

Reputation: 1

I've been seeing the exact same thing. Did some of your steps to solve it as well and eventually found out, that one of my workers was having intermittent network connectivity issues. I got to that, because I was seeing DNS queries failing randomly. That said, it could also be coredns not having enough resources to cope with the clusters demands. You could also check that.

Upvotes: 0

Related Questions