Reputation: 323
Prometheus 2.39.0
Alertmanager 0.24.0
Telegraf 1.20.2
I'm trying to setup an alert to check if MySQL/Nginx are running or not in a remote host. I've setup 2 Prometheus jobs For MySQL
- job_name: "gm_mysql_pid"
scheme: "https"
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
[
"host1.com:9273",
]
and for Nginx
- job_name: "gm_telegraf_exporter"
scheme: "https"
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
[
"host1.com:9273",
]
Alertmanager configuration
route:
group_by: ["alertname", "group", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 120h
receiver: devops-team
routes:
- match:
group: gm_telegraf_exporter
continue: true
receiver: prometheus-receiver
receivers:
- name: "prometheus-receiver"
slack_configs:
- api_url: https://hooks.slack.com/xxx
channel: "alerts-prometheus"
send_resolved: true
title: '{{ template "custom_title" . }}'
text: '{{ template "custom_slack_message" . }}'
Alert rules configuration
MySQL
- alert: gm_mysql_pid
expr: procstat_lookup_pid_count{job="gm_mysql_pid",pid_finder="",pidfile="/var/run/mysqld/mysqld.pid"} <= 0
labels:
group: "gm_mysql_pid"
annotations:
identifier: "Host: {{$labels.host}}"
description: "Trigger: MySQL service is down!"
Nginx
- alert: gm_nginx_pid
expr: procstat_lookup_pid_count{job="gm_telegraf_exporter",pid_finder="",pidfile="/var/run/nginx.pid"} <= 0
labels:
group: "gm_telegraf_exporter"
annotations:
identifier: "Host: {{$labels.host}}"
description: "Trigger: Nginx service is down!"
Looking into Telegraf metrics I see both metrics:
procstat_lookup_pid_count{host="host1.com",pid_finder="pgrep",pidfile="/var/run/mysqld/mysqld.pid",result="success"} 1
procstat_lookup_pid_count{host="host1.com",pid_finder="pgrep",pidfile="/var/run/nginx.pid",result="success"} 1
The issue is that if I stop both services (MySQL and Nginx) the Alertmanager alert is not being triggered, even if metrics show that both services are down...
procstat_lookup_pid_count{host="host1.com",pid_finder="pgrep",pidfile="",result="lookup_error"} 0
Am I missing something?
Upvotes: -1
Views: 32
Reputation: 323
Found out that the issue was caused by the Telegraf version running on the server. The expression is running fine with version 1.8 but not with version 1.25. Downgrading to ver. 1.8 solved the issue!
Upvotes: 0