Erik Mulder
Erik Mulder

Reputation: 93

Cloud SQL Proxy: occasional connection timeouts under load

Problem

Under high load, our Cloud SQL Proxy occasionally hits this:

2020/06/05 13:35:47 couldn't connect to "my-cloudsql-instance": dial tcp xx.xx.xx.xx:3307: connect: connection timed out

Context

Kubernetes cluster with an Airflow pod that starts a lot of tasks in parallel with the LocalExecutor. Each of these new tasks will connect to the Airflow metadata database (which runs in Cloud SQL) through the Cloud SQL Proxy (sidecar of the Airflow pod). Every once in a while the error above happens, which causes the task to fail in Airflow.

What I've tested and found out so far:

[2020-06-04 11:11:13,839] {taskinstance.py:1128} ERROR -
    (psycopg2.OperationalError) server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

(Background on this error at: http://sqlalche.me/e/e3q8)

Upvotes: 0

Views: 1699

Answers (1)

Erik Mulder
Erik Mulder

Reputation: 93

This issue is probably caused by our custom network infra setup and not by any Google tools or services. We did find an interesting solution though: add another proxy! Whut, why? Turns out Airflow does no proper connection pooling, so connections were constantly opened and closed, including the SSL handshake and authentication / authorization overhead coming with the use of the Cloud SQL Proxy container. Hence the high load and occasional connection dropping.

We added a PgBouncer container to the pod Airflow was running on and used the proper connection pooling implemented there. All the connection opening and closing now happens over the local network inside the pod without SSL or complicated authentication, so is super fast. No more high load, no more connection dropping!

Upvotes: 1

Related Questions