Configuring Spring Boot to Behave Well in High-Load in Kubernetes Live Probes

Question

I am using a deployment of Spring Boot (typical micro-service web server deployment, with Gateway, separate authentication server, etc, fronted with a reverse proxy/load balancing nginx deployment). We orchestrate Docker containers with Kubernetes. We are preparing for production deployment and have recently started load testing, revealing some issues in the handling of these loads.

My issue is that when subjecting the server to high loads (here, performance testing with Gatling), the liveness probes return 503 errors, because of heavy load; this triggers a restart by Kubernetes.

Naturally, the liveness probe is important, but when the system starts dropping requests, the last thing we should do is to kill pods, which causes cascading failures by shifting load to the remaining pods.

This specific problem with the Spring Actuator health check is described in this SO question, and offers some hints, but the answers are not thorough. Specifically, the idea of using a liveness command (e.g. to check if the java process is running) seems to me inadequate, since it would miss actual down-time if the java process is running but there is some exception, or some missing resource (database, Kafka...)

Is there a good guide for configuring production Spring on Kubernetes/Cloud deployments?
How do I deal with the specific issue of the liveness probe failing when subjected to high loads, does anyone have experience with this?

Will R.O.F. · Accepted Answer

Note: This is the answer provided by @AndyWilkinson and @ChinHuang on comments which @AlexandreCassagne stated that solved the issue:

If a liveness probe indicates that the current level of traffic is overwhelming your app such that it cannot handle requests, trying to find a way to suppress that seems counter-productive to me. Do you have a readiness probe configured? When your app becomes overwhelmed, you probably want it to indicate that it is unable to handle traffic for a while. Once the load has dropped and it's recovered, it can then start handling traffic again without the need for a restart.

Also, a liveness probe should only care about a missing resource (database, Kafka, etc) if that resource is only used by a single instance. If multiple instances all access the resource and it goes down, all of the liveness probes will fail. This will cause cascading failures and restarts across your deployment. There's some guidance on this in the Spring Boot 2.3 reference documentation.

Spring Boot 2.3 introduces separate liveness and readiness probes.

Configuring Spring Boot to Behave Well in High-Load in Kubernetes Live Probes

Answers (1)

Related Questions