Reputation: 149
So I am running a k8s cluster with 3 pod postgres cluster fronted by a 3 pod pgbouncer cluster. Connecting to that is a batch job with multiple parallel workers which stream data into the database via pgbouncer. If I run 10 of these batch job pods everything works smoothly. If I go up an order of magnitude to 100 job pods, a large portion of them fail to connect to the database with the error got error driver: bad connection
. Multiple workers run on the same node (5 worker pods per node) So it's only ~26 pods in the k8s cluster.
What's maddening is I'm not seeing any postgres or pgbouncer error/warning logs in Kibana and their pods aren't failing. Also Prometheus logging shows it to be well under the max connections.
Below are the postgres and pgbouncer configs along with the connection code of the workers.
Relevant Connection Code From Worker:
err = backoff.Retry(func() error {
p.connection, err = gorm.Open(postgres.New(postgres.Config{
DSN: p.postgresUrl,
}), &gorm.Config{Logger: newLogger})
return err
}, backoff.NewExponentialBackOff())
if err != nil {
log.Panic(err)
}
Postgres Config:
postgresql:
parameters:
max_connections = 200
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 4
effective_io_concurrency = 2
work_mem = 6990kB
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 6
max_parallel_workers_per_gather = 3
max_parallel_workers = 6
max_parallel_maintenance_workers = 3
PgBouncer Config:
[databases]
* = host=loot port=5432 auth_user=***
[pgbouncer]
listen_port = 5432
listen_addr = *
auth_type = md5
auth_file = /pgconf/users.txt
auth_query = SELECT username, password from pgbouncer.get_auth($1)
pidfile = /tmp/pgbouncer.pid
logfile = /dev/stdout
admin_users = ***
stats_users = ***
default_pool_size = 20
max_client_conn = 600
max_db_connections = 190
min_pool_size = 0
pool_mode = session
reserve_pool_size = 0
reserve_pool_timeout = 5
query_timeout = 0
ignore_startup_parameters = extra_float_digits
Screenshot of Postgres DB Stats
Things I've tried:
Having the jobs connect directly to the cluster IP of the Pgbouncer service to rule out DNS.
Increasing PgBouncer connection pool
I'm not sure what the issue is here since I don't have any errors from the DB side to fix and a basic error message from the job side. Any help would be appreciated and I can add more context if a key piece is missing.
Upvotes: 0
Views: 532
Reputation: 149
This ended up being an issue of postgres not actually using the configmap I had set. The map was for 200 connections but the actual DB was still at the default of 100.
Not much to learn here other than make sure to check that the configs you set actually propagate to the actual service.
Upvotes: 0