Reputation: 40336
UPDATES
- Using a different exporter (
fly-exporter
), alerts fire. If I run the Prometheus|Alertmanager and thegcp-exporter
on a different host with the ~same config, alerts fire correctly. My hypothesis is thatgcp-exporter
scrapes (!?) are failing (periodically) and the absence of data (I'm unable to confirm this) is what resets the alerts.- Discovered
ALERTS
synthetic time-series. Prometheus' graph of this, shows gaps. Querying the APImin_over_time(ALERTS{alertname="gcp_cloud_run_services_running"}[12h])
is 1.- "Active Since" (
activeAt
) is resetting periodically for no reason that's obvious to me. The time-series behind the alerts don't appear to include any 0 values.
I wrote an exporter for Google Cloud services that I use to alert when I'm consuming (more) resources than expected.
It's been working for several months and alerting me (email|Pushover).
I think (!?) the only changes I've made recently have been to bump:
docker.io/prom/prometheus:v2.37.0
docker.io/prom/alertmanager:v0.24.0
However, I'm no longer (~weeks?) being alerted when I expect to.
This is because the alerts move to "pending" but never "firing".
I don't understand why this is.
The issue is with the Prometheus alert not "firing" and not with AlertManager.
I've definitively been running resources on 2 different monitored services (Cloud Functions; Cloud Run) for longer than 1 day:
date --rfc-3339=seconds --utc
2022-07-22 22:03:37+00:00
gcloud functions list \
--project=${PROJECT} \
--format="value(updateTime)"
2020-12-03T20:23:30.940Z # 03-December-2020
gcloud run services list --project=${PROJECT} \
--format="value(status.conditions['lastTransitionTime'].slice(0))"
2022-07-21T16:21:24.293561Z # 21-July-2022
2022-07-21T16:20:48.439194Z # 21-July-2022
And /api/v1/alerts
:
{
"status": "success",
"data": {
"alerts": [
{
"labels": {
"alertname": "gcp_cloud_functions_running",
"severity": "page"
},
"annotations": {
"summary": "GCP Cloud Functions running"
},
"state": "pending",
"activeAt": "2022-07-22T21:56:02.984175081Z",
"value": "1e+00"
},
{
"labels": {
"alertname": "gcp_cloud_run_services_running",
"severity": "page"
},
"annotations": {
"summary": "GCP Cloud Run services running"
},
"state": "pending",
"activeAt": "2022-07-22T21:56:02.984175081Z",
"value": "2e+00"
}
]
}
}
NOTE As I'm typing this question, I've noticed that the
activeAt
values are21:56:02
which is within the 6 hour window. But, why only since then? The resources have existed for longer than that and Prometheus has been been running since (2022-07-21 16:14) when I restarted it thinking it had become wedged.
UPDATE The alerts
activeAt
property appears to (be) reset every 30 minutes:2022-07-23T01:26:02.984175081Z 2022-07-23T01:26:02.984175081Z 2022-07-23T00:56:02.984175081Z 2022-07-23T00:56:02.984175081Z
Querying the 2 time-series, I don't find any 0 values that could reset the (
6h
) timer and it's curious that both alerts have the sameactiveAt
values (both times)?
And, if I query the time-series:
QUERY="..." # See below
# Currently 22:03 so this range covers 22 hours' data
START="2022-07-01T00:00:00.000Z"
END="2022-07-22T23:59:59.999Z
STEP="5m"
# Results show by QUERY below
curl \
--silent \
--data-urlencode "query=${QUERY}" \
--data "start=${START}" \
--data "end=${END}" \
--data "step=${STEP}" \
"http://${HOST}:${PORT}/api/v1/query_range" \
| jq -r '.data.result[].values[][1]' \
| sort \
| uniq -c
# Cloud Functions
QUERY="min_over_time(gcp_cloud_functions_functions[15m]>0"
348 1
60 2
# Cloud Run
QUERY="min_over_time(gcp_cloud_run_services[15m])>0"
60 10
609 2
My interpretation of the above is that, for both alerts, the data (admittedly stepped in 5-minute increments) never includes zeros and thus is always >0
for both queries. Yes, the alerts don't fire.
Questions
prometheus.yml
:
global:
scrape_interval: 1m
scrape_timeout: 10s
evaluation_interval: 1m
alerting:
alertmanagers:
- follow_redirects: true
enable_http2: true
scheme: http
timeout: 10s
api_version: v2
static_configs:
- targets:
- localhost:9093
rule_files:
- /etc/alertmanager/rules.yml
scrape_configs:
- job_name: gcp-exporter
honor_timestamps: true
scrape_interval: 15m
scrape_timeout: 30s
metrics_path: /metrics
scheme: http
follow_redirects: true
enable_http2: true
static_configs:
- targets:
- localhost:9402
rules.yml
:
groups:
- name: gcp_exporter
rules:
- alert: gcp_cloud_functions_running
expr: min_over_time(gcp_cloud_functions_functions{}[15m]) > 0
for: 6h
labels:
severity: page
annotations:
summary: GCP Cloud Functions running
- alert: gcp_cloud_run_services_running
expr: min_over_time(gcp_cloud_run_services{}[15m]) > 0
for: 6h
labels:
severity: page
annotations:
summary: GCP Cloud Run services running
{
"status": "success",
"data": {
"startTime": "2022-07-21T16:14:23.941571056Z",
"CWD": "/prometheus",
"reloadConfigSuccess": true,
"lastConfigTime": "2022-07-21T16:14:23Z",
"corruptionCount": 0,
"goroutineCount": 43,
"GOMAXPROCS": 4,
"GOGC": "",
"GODEBUG": "",
"storageRetention": "15d"
}
}
{
"status": "success",
"data": {
"version": "2.37.0",
"revision": "b41e0750abf5cc18d8233161560731de05199330",
"branch": "HEAD",
"buildUser": "root@0ebb6827e27f",
"buildDate": "20220714-15:19:21",
"goVersion": "go1.18.4"
}
}
{
"status": "success",
"data": {
"activeAlertmanagers": [
{
"url": "http://localhost:9093/api/v2/alerts"
}
],
"droppedAlertmanagers": []
}
}
Upvotes: 0
Views: 4021
Reputation: 40336
I'm confident that the underlying issue was my mistaken use of overly long scrape_interval
values.
I'd had:
scrape_configs:
- job_name: gcp-exporter
scrape_interval: 15m
I realized that this results in periods of 15-minute of measurements of alerts with 1-minute gaps:
Yesterday, I read Brian Brazil's Keep It Simple scrape_interval
and I reverted the scrape_interval
to the global
value of scrape_interval: 1m
.
Now, the ALERT
metrics appear continuous and, most importantly, alerts are firing as expected:
Upvotes: 0