Reputation: 7764
I have my kubernetes cluster setup on AWS where I am trying to monitor several pods, using cAdvisor + Prometheus + Alert manager. What I want to do it launch an email alert (with service/container name) if a container/pod goes down or stuck in Error or CarshLoopBackOff state or stcuk in anyother state apart from running.
Upvotes: 12
Views: 16651
Reputation: 1128
I'm using this one :
- alert: PodCrashLooping
annotations:
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
summary: Pod is crash looping.
expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace=~".*"}[5m]) * 60 * 5 > 0
for: 5m
labels:
severity: critical
Upvotes: 0
Reputation: 8983
Prometheus collects a wide range of metrics. As an example, you can use a metric kube_pod_container_status_restarts_total
for monitoring restarts, which will reflect your problem.
It containing tags which you can use in the alert:
container-name
pod-namespace
pod-name
So, everything you need is to configure your alertmanager.yaml
config by adding correct SMTP settings, receiver and rules like that:
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'localhost:25'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
receivers:
- name: 'team-X-mails'
email_configs:
- to: '[email protected]'
# Only one default receiver
route:
receiver: team-X-mails
# Example group with one alert
groups:
- name: example-alert
rules:
# Alert about restarts
- alert: RestartAlerts
expr: count(kube_pod_container_status_restarts_total) by (pod-name) > 5
for: 10m
annotations:
summary: "More than 5 restarts in pod {{ $labels.pod-name }}"
description: "{{ $labels.container-name }} restarted (current value: {{ $value }}s) times in pod {{ $labels.pod-namespace }}/{{ $labels.pod-name }}"
Upvotes: 15