Reputation: 492
I have an application deployed to kubernetes. Here is techstack: Java 11, Spring Boot 2.3.x or 2.5.x, using hikari 3.x or 4.x
Using spring actuator to do healthcheck. Here is liveness
and readiness
configuration within application.yaml:
endpoint:
health:
group:
liveness:
include: '*'
exclude:
- db
- readinessState
readiness:
include: '*'
what it does if DB is down -
liveness
doesn't get impacted - meaning, application
container should keep on running even if there is DB outage.readinesss
will be impacted making sure no traffic is allowed to hit the container.liveness
and readiness
configuration in container spec:
livenessProbe:
httpGet:
path: actuator/health/liveness
port: 8443
scheme: HTTPS
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
readinessProbe:
httpGet:
path: actuator/health/readiness
port: 8443
scheme: HTTPS
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 20
My application is started and running fine for few hours.
What I did:
I brought down DB.
Issue Noticed:
When DB is down, after 90+ seconds I see 3 more pods getting spinned up. When a pod is described I see Status and condition like below:
Status: Running
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
when I list all running pods:
NAME READY STATUS RESTARTS AGE
application-a-dev-deployment-success-5d86b4bcf4-7lsqx 0/1 Running 0 6h48m
application-a-dev-deployment-success-5d86b4bcf4-cmwd7 0/1 Running 0 49m
application-a-dev-deployment-success-5d86b4bcf4-flf7r 0/1 Running 0 48m
application-a-dev-deployment-success-5d86b4bcf4-m5nk7 0/1 Running 0 6h48m
application-a-dev-deployment-success-5d86b4bcf4-tx4rl 0/1 Running 0 49m
My Analogy/Finding:
Per ReadinessProbe
configuration: periodSeconds
is set to 30 seconds and failurethreshold
is defaulted to 3 per k8s documentation.
Per application.yaml readiness
includes db check, meaning after every 30 seconds readiness
check failed. When it fails 3 times, failurethreshold
is met and it spins up new pods.
Restart policy is default to Always.
Questions:
restartpolicy
?Upvotes: 2
Views: 3370
Reputation: 1
Readiness probe doesn't restart pod but just marks its Ready state to false.
failureThreshold: When a probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness probe means restarting the container. In case of readiness probe the Pod will be marked Unready. Defaults to 3. Minimum value is 1.
Saying that, you should also consider Liveness probe which actually restart pod at similar situation. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#:~:text=Play%20with%20Kubernetes-,Define%20a%20liveness%20command,-Many%20applications%20running
Upvotes: 0
Reputation: 492
The crux lied in HPA. CPU utilization of POD after readiness failure used to jump up and as it was going above 70% HPA was getting triggered and started those 3 pods.
Upvotes: 0
Reputation: 841
failureThreshold
. You can change your restartPolicy
to OnFailure
, it will allow you to restart the job only if it fails or Never
if you don't want have the cluster to be restarted. The difference between the statuses you can find here. Note this:The restartPolicy applies to all containers in the Pod. restartPolicy only refers to restarts of the containers by the kubelet on the same node. After containers in a Pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, …), that is capped at five minutes. Once a container has executed for 10 minutes without any problems, the kubelet resets the restart backoff timer for that container.
Share your full Deployment
file, I suppose that you've set replicas
number to 3
.
Answered in the answer for the 1st question.
Also note this, if this works for you:
Startup probes are useful for Pods that have containers that take a long time to come into service. Rather than set a long liveness interval, you can configure a separate configuration for probing the container as it starts up, allowing a time longer than the liveness interval would allow.
If your container usually starts in more than initialDelaySeconds + failureThreshold × periodSeconds, you should specify a startup probe that checks the same endpoint as the liveness probe. The default for periodSeconds is 10s. You should then set its failureThreshold high enough to allow the container to start, without changing the default values of the liveness probe. This helps to protect against deadlocks.
Upvotes: 1