staticFlow
staticFlow

Reputation: 149

Why do some of my kubernetes nodes fail to connect to my postgres cluster while others succeed?

So I am running a k8s cluster with 3 pod postgres cluster fronted by a 3 pod pgbouncer cluster. Connecting to that is a batch job with multiple parallel workers which stream data into the database via pgbouncer. If I run 10 of these batch job pods everything works smoothly. If I go up an order of magnitude to 100 job pods, a large portion of them fail to connect to the database with the error got error driver: bad connection. Multiple workers run on the same node (5 worker pods per node) So it's only ~26 pods in the k8s cluster.

What's maddening is I'm not seeing any postgres or pgbouncer error/warning logs in Kibana and their pods aren't failing. Also Prometheus logging shows it to be well under the max connections.

Below are the postgres and pgbouncer configs along with the connection code of the workers.

Relevant Connection Code From Worker:

err = backoff.Retry(func() error {
        p.connection, err = gorm.Open(postgres.New(postgres.Config{
            DSN: p.postgresUrl,
        }), &gorm.Config{Logger: newLogger})
        return err
    }, backoff.NewExponentialBackOff())
    if err != nil {
        log.Panic(err)
    }

Postgres Config:

postgresql:
      parameters:
        max_connections = 200
        shared_buffers = 4GB
        effective_cache_size = 12GB
        maintenance_work_mem = 1GB
        checkpoint_completion_target = 0.7
        wal_buffers = 16MB
        default_statistics_target = 100
        random_page_cost = 4
        effective_io_concurrency = 2
        work_mem = 6990kB
        min_wal_size = 1GB
        max_wal_size = 4GB
        max_worker_processes = 6
        max_parallel_workers_per_gather = 3
        max_parallel_workers = 6
        max_parallel_maintenance_workers = 3

PgBouncer Config:

[databases]
* = host=loot port=5432 auth_user=***

[pgbouncer]
listen_port = 5432
listen_addr = *
auth_type = md5
auth_file = /pgconf/users.txt
auth_query = SELECT username, password from pgbouncer.get_auth($1)
pidfile = /tmp/pgbouncer.pid
logfile = /dev/stdout
admin_users = ***
stats_users = ***
default_pool_size = 20
max_client_conn = 600
max_db_connections = 190
min_pool_size = 0
pool_mode = session
reserve_pool_size = 0
reserve_pool_timeout = 5
query_timeout = 0
ignore_startup_parameters = extra_float_digits

Screenshot of Postgres DB Stats Prometheus stats for Postgres DB

Things I've tried:

  1. Having the jobs connect directly to the cluster IP of the Pgbouncer service to rule out DNS.

  2. Increasing PgBouncer connection pool

I'm not sure what the issue is here since I don't have any errors from the DB side to fix and a basic error message from the job side. Any help would be appreciated and I can add more context if a key piece is missing.

Upvotes: 0

Views: 532

Answers (1)

staticFlow
staticFlow

Reputation: 149

This ended up being an issue of postgres not actually using the configmap I had set. The map was for 200 connections but the actual DB was still at the default of 100.

Not much to learn here other than make sure to check that the configs you set actually propagate to the actual service.

Upvotes: 0

Related Questions