Raghav Patel
Raghav Patel

Reputation: 71

Kubernetes pod's liveness probe failing without any error message

I'm running pods in EKS and in the pods there 3 containers. One of the container restarting in every 5 min with message "Liveness probe failed:". In Liveness probes there is no error message why liveness probe failed.

Here is the output of the pod describe

2023-02-07T14:43:00Z   2023-02-07T14:43:00Z   1       default-scheduler   Normal    Scheduled   Successfully assigned <my pod name>/<my pod name>-8ffcd5c5c-5qt7v to ip-10-21-165-115.ap-south-1.compute.i
nternal
2023-02-07T14:43:02Z   2023-02-07T14:43:02Z   1       kubelet             Normal    Pulled      Container image "<my docker repository>/proxyv2:1.12.8-034f0f9b2e-distroless" already present on machine
2023-02-07T14:43:02Z   2023-02-07T14:43:02Z   1       kubelet             Normal    Created     Created container istio-init
2023-02-07T14:43:02Z   2023-02-07T14:43:02Z   1       kubelet             Normal    Started     Started container istio-init
2023-02-07T14:43:03Z   2023-02-07T14:48:06Z   2       kubelet             Normal    Pulled      Container image "<my docker repository >/<my pod name>:1.74.3-SNAPSHOT" already present on machine
2023-02-07T14:43:03Z   2023-02-07T14:48:06Z   2       kubelet             Normal    Created     Created container <my pod name>
2023-02-07T14:43:03Z   2023-02-07T14:43:03Z   1       kubelet             Normal    Started     Started container <my pod name>
2023-02-07T14:43:03Z   2023-02-07T14:43:03Z   1       kubelet             Normal    Pulled      Container image "<my docker repository >/proxyv2:1.12.8-034f0f9b2e-distroless" already present on machine
2023-02-07T14:43:03Z   2023-02-07T14:43:03Z   1       kubelet             Normal    Created     Created container istio-proxy
2023-02-07T14:43:03Z   2023-02-07T14:43:03Z   1       kubelet             Normal    Started     Started container istio-proxy
2023-02-07T14:43:04Z   2023-02-07T14:43:06Z   5       kubelet             Warning   Unhealthy   Readiness probe failed: Get "http://10.21.169.218:15021/healthz/ready": dial tcp 10.21.169.218:15021: connec
t: connection refused
2023-02-07T14:47:31Z   2023-02-07T14:58:02Z   18      kubelet             Warning   Unhealthy   Readiness probe failed:
2023-02-07T14:47:41Z   2023-02-07T14:48:01Z   3       kubelet             Warning   Unhealthy   Liveness probe failed:
2023-02-07T14:48:01Z   2023-02-07T14:48:01Z   1       kubelet             Normal    Killing     Container <my pod name> failed liveness probe, will be restarted

Here is my Dockerfile

FROM openjdk:8-jdk-alpine

ARG JAR_FILE
ARG SERVICE_PORT
ENV JMX_VERSION=0.12.0
ENV GRPC_HEALTH_PROBE_VERSION=v0.4.5
ENV GRPCURL_VERSION=1.8.7

# Install and configure JMX exporter
RUN mkdir -p /opt/jmx
COPY ./devops/jmx-config.yaml /opt/jmx/config.yaml
RUN wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/${JMX_VERSION}/jmx_prometheus_javaagent-${JMX_VERSION}.jar -O /opt/jmx/jmx.jar

# Install grpc_health_probe binary
RUN wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64 && \
    chmod +x /bin/grpc_health_probe

#Install grpcurl binary
RUN wget -P /tmp/ https://github.com/fullstorydev/grpcurl/releases/download/v${GRPCURL_VERSION}/grpcurl_${GRPCURL_VERSION}_linux_x86_64.tar.gz \
    && tar -xvf /tmp/grpcurl* -C /bin/ \
    && chmod +x /bin/grpcurl \
    && rm -rf /tmp/grpcurl*

#Install jq
RUN apk add jq

# Install .proto file
RUN mkdir -p /lib-grpc-actuator/src/main/proto
COPY ./lib-grpc-actuator/src/main/proto/grpc_health.proto /lib-grpc-actuator/src/main/proto

#Copy bashscript of health check
COPY grpcurl_health.sh /opt/
RUN chmod +x /opt/grpcurl_health.sh

# Expose grpc metric port, jmx exporter port
EXPOSE 9101 9110

COPY ${JAR_FILE} /app.jar

# Expose service port
EXPOSE ${SERVICE_PORT}

CMD java -Dlog4j.configuration=file:/opt/log4j-properties/log4j.properties -XX:+UseG1GC $JAVA_OPTS -javaagent:/opt/jmx/jmx.jar=9101:/opt/jmx/config.yaml -jar -Dconfig-file=/opt/config-properties/config.properties /app.jar

Here is the shell script I'm using for Liveness and Readiness Probes

#!/bin/sh

#define service grpc port
service_prot=$1

#grpc_health_actuators grpcurl command
response=`/bin/grpcurl \
    -plaintext \
    -import-path /lib-grpc-actuator/src/main/proto/ \
    -proto grpc_health.proto \
    :$service_prot \
    com.<org name>.grpc.generated.grpc_health.HealthCheckService/health`

#grep the status from response
status=`echo $response | jq -r .status`

#echo response
echo $response

#base on status code return script status code
if [ "$status" == "UP" ]
then
    echo "service is healthy : $response"
    exit 0
else
    echo "service is down : $response"
    exit 1
fi

Here is my kubernetes deployment YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "15"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"kubernetes.io/change-cause":"kubectl apply --kubeconfig=config --filename=manifests.yaml --record=true","traffic.sidecar.istio.io/excludeOutboundIPRanges":"*"},"name":"<my pod name>","namespace":"<my pod name>"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"<my pod name>","harness.io/track":"stable"}},"strategy":{"rollingUpdate":{"maxSurge":1,"maxUnavailable":0},"type":"RollingUpdate"},"template":{"metadata":{"labels":{"app":"<my pod name>","harness.io/release-name":"release-89ef3582-d056-337f-8df0-97a3e7327caa","harness.io/track":"stable","version":"1.74.3-SNAPSHOT"}},"spec":{"containers":[{"env":[{"name":"JAVA_OPTS","value":"-Xms500m -Xmx900m"}],"image":"<my docker registry>/<my pod name>:1.74.3-SNAPSHOT","livenessProbe":{"exec":{"command":["/bin/sh","/opt/grpcurl_health.sh","50045"]},"initialDelaySeconds":20},"name":"<my pod name>","ports":[{"containerPort":50045,"name":"grpc","protocol":"TCP"},{"containerPort":9110,"name":"http-metrics","protocol":"TCP"},{"containerPort":9101,"name":"jmx-metrics","protocol":"TCP"}],"readinessProbe":{"exec":{"command":["/bin/sh","/opt/grpcurl_health.sh","50045"]},"initialDelaySeconds":10},"resources":{"limits":{"cpu":"2","memory":"2Gi"},"requests":{"cpu":"1","memory":"1Gi"}},"volumeMounts":[{"mountPath":"/opt/config-properties","name":"config-properties"},{"mountPath":"/opt/log4j-properties","name":"log4j-properties"}]}],"imagePullSecrets":[{"name":"<my pod name>-dockercfg"}],"serviceAccountName":"backend-services","volumes":[{"configMap":{"name":"config-properties-9"},"name":"config-properties"},{"configMap":{"name":"log4j-properties-9"},"name":"log4j-properties"}]}}}}
    kubernetes.io/change-cause: kubectl apply --kubeconfig=config --filename=manifests.yaml
      --record=true
    traffic.sidecar.istio.io/excludeOutboundIPRanges: '*'
  creationTimestamp: "2023-01-11T19:23:33Z"
  generation: 42
  name: <my pod name>
  namespace: <my pod name>
  resourceVersion: "305338514"
  uid: 4053e956-e28e-4c35-9b84-b50df2a1b8ff
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: <my pod name>
      harness.io/track: stable
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: <my pod name>
        harness.io/release-name: release-89ef3582-d056-337f-8df0-97a3e7327caa
        harness.io/track: stable
        version: 1.74.3-SNAPSHOT
    spec:
      containers:
      - env:
        - name: JAVA_OPTS
          value: -Xms500m -Xmx900m
        image: <my docker registry>/<my pod name>:1.74.3-SNAPSHOT
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - /opt/grpcurl_health.sh
            - "50045"
          failureThreshold: 3
          initialDelaySeconds: 20
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: <my pod name>
        ports:
        - containerPort: 50045
          name: grpc
          protocol: TCP
        - containerPort: 9110
          name: http-metrics
          protocol: TCP
        - containerPort: 9101
          name: jmx-metrics
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - /opt/grpcurl_health.sh
            - "50045"
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
          requests:
            cpu: "1"
            memory: 1Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /opt/config-properties
          name: config-properties
        - mountPath: /opt/log4j-properties
          name: log4j-properties
        - mountPath: /opt/script-logs
          name: debug
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: <my pod name>-dockercfg
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: backend-services
      serviceAccountName: backend-services
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: config-properties-9
        name: config-properties
      - configMap:
          defaultMode: 420
          name: log4j-properties-9
        name: log4j-properties
      - hostPath:
          path: /tmp/
          type: ""
        name: debug

Please help me to figure this issue.

Insted of shell script I tried out to put hole command in Liveness probe and Readiness probe like below. but with that I'm getting the same output.

sh -c "if [ $(/bin/grpcurl -plaintext -import-path /lib-grpc-actuator/src/main/proto/ -proto grpc_health.proto :50045 com.<my org name>.grpc.generated.grpc_health.HealthCheckService/health | jq -r .status) == 'UP' ]; then exit 0; else echo $(/bin/grpcurl -plaintext -import-path /lib-grpc-actuator/src/main/proto/ -proto grpc_health.proto :50045 com.<my org name>.grpc.generated.grpc_health.HealthCheckService/health) && exit 1; fi"

Upvotes: 0

Views: 2240

Answers (2)

GorginZ
GorginZ

Reputation: 131

It looks like it's your istio container failing the probes.

2023-02-07T14:43:04Z   2023-02-07T14:43:06Z   5       kubelet             Warning   Unhealthy   Readiness probe failed: Get "http://10.21.169.218:15021/healthz/ready": dial tcp 10.21.169.218:15021: connec
t: connection refused
2023-02-07T14:47:31Z   2023-02-07T14:58:02Z   18      kubelet             Warning   Unhealthy   Readiness probe failed:
2023-02-07T14:47:41Z   2023-02-07T14:48:01Z   3       kubelet             Warning   Unhealthy   Liveness probe failed:

The readiness probe shows a connection refused error on 15021, which is the istio health check port. Perhaps check through these istio deployment requirements

The deployment manifest only shows one of your containers. Could you share your istio container configuration?

Upvotes: 0

DazWilkin
DazWilkin

Reputation: 40061

Some things:

1. Kubernetes 1.24+ includes a gRPC probe

You could:

livenessProbe:
  grpc:
    port: 50045

This works for me.

2. Your container image includes grpc_health_probe but you're not using it. It would probably be the 2nd choice after the native probe (above) and using gRPCurl which is making your life more complex.

This works for me.

3. (I've simplified your script) but you should consider using e.g. /bin/sh -c or /bin/bash -c and then providing your script as a string:
xxxxxxxxProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - "./grpcurl_health.sh :50045"

It's unclear from your question but I think you're using a non-standard variant of GRPC Health Checking Protocol. I'm using this version in my repro of your issue and it returns e.g. SERVING as the status value:

#!/bin/env bash

ENDPOINT=${1}

STATUS=$(\
  grpcurl -plaintext  ${ENDPOINT} grpc.health.v1.Health/Check \
  | jq -r .status)

if [ "${STATUS}" == "SERVING" ]
then
    echo "Service is healthy"
    exit 0
else
    echo "service is unhealthy"
    exit 1
fi

I'm using ENDPOINT rather than PORT for convenience too.

This works for me.

4. A useful debugging mechanism in this case is to kubectl exec into your container and run your commands manually:

# Example 1: not-testable this way

# Example 2
kubectl exec \
--stdin --tty \
deployment/${DEPLOYMENT} \
--namespace=${NAMESPACE} \
--container=${CONTAINER} \
-- grpc_health_probe -addr=:50051
status: SERVING

# Example 3
kubectl exec \
--stdin --tty \
deployment/${DEPLOYMENT} \
--namespace=${NAMESPACE} \
--container=${CONTAINER} \
-- ./grpcurl_health.sh ":50051" && echo ${?}
Service is healthy
0

Upvotes: 0

Related Questions