DazWilkin
DazWilkin

Reputation: 40336

Alert always "pending" never "firing" after only (!?) upgrading container versions

UPDATES

  1. Using a different exporter (fly-exporter), alerts fire. If I run the Prometheus|Alertmanager and the gcp-exporter on a different host with the ~same config, alerts fire correctly. My hypothesis is that gcp-exporter scrapes (!?) are failing (periodically) and the absence of data (I'm unable to confirm this) is what resets the alerts.
  2. Discovered ALERTS synthetic time-series. Prometheus' graph of this, shows gaps. Querying the API min_over_time(ALERTS{alertname="gcp_cloud_run_services_running"}[12h]) is 1.
  3. "Active Since" (activeAt) is resetting periodically for no reason that's obvious to me. The time-series behind the alerts don't appear to include any 0 values.

I wrote an exporter for Google Cloud services that I use to alert when I'm consuming (more) resources than expected.

It's been working for several months and alerting me (email|Pushover).

I think (!?) the only changes I've made recently have been to bump:

  1. container image versions
  1. google.golang.org/api

However, I'm no longer (~weeks?) being alerted when I expect to.

This is because the alerts move to "pending" but never "firing".

I don't understand why this is.

The issue is with the Prometheus alert not "firing" and not with AlertManager.

I've definitively been running resources on 2 different monitored services (Cloud Functions; Cloud Run) for longer than 1 day:

date --rfc-3339=seconds --utc
2022-07-22 22:03:37+00:00

gcloud functions list \
--project=${PROJECT} \
--format="value(updateTime)"
2020-12-03T20:23:30.940Z # 03-December-2020

gcloud run services list --project=${PROJECT} \
--format="value(status.conditions['lastTransitionTime'].slice(0))"
2022-07-21T16:21:24.293561Z # 21-July-2022
2022-07-21T16:20:48.439194Z # 21-July-2022

And /api/v1/alerts:

{
  "status": "success",
  "data": {
    "alerts": [
      {
        "labels": {
          "alertname": "gcp_cloud_functions_running",
          "severity": "page"
        },
        "annotations": {
          "summary": "GCP Cloud Functions running"
        },
        "state": "pending",
        "activeAt": "2022-07-22T21:56:02.984175081Z",
        "value": "1e+00"
      },
      {
        "labels": {
          "alertname": "gcp_cloud_run_services_running",
          "severity": "page"
        },
        "annotations": {
          "summary": "GCP Cloud Run services running"
        },
        "state": "pending",
        "activeAt": "2022-07-22T21:56:02.984175081Z",
        "value": "2e+00"
      }
    ]
  }
}

NOTE As I'm typing this question, I've noticed that the activeAt values are 21:56:02 which is within the 6 hour window. But, why only since then? The resources have existed for longer than that and Prometheus has been been running since (2022-07-21 16:14) when I restarted it thinking it had become wedged.

UPDATE The alerts activeAt property appears to (be) reset every 30 minutes:

2022-07-23T01:26:02.984175081Z
2022-07-23T01:26:02.984175081Z

2022-07-23T00:56:02.984175081Z
2022-07-23T00:56:02.984175081Z

Querying the 2 time-series, I don't find any 0 values that could reset the (6h) timer and it's curious that both alerts have the same activeAt values (both times)?

And, if I query the time-series:

QUERY="..." # See below

# Currently 22:03 so this range covers 22 hours' data
START="2022-07-01T00:00:00.000Z"
END="2022-07-22T23:59:59.999Z
STEP="5m"

# Results show by QUERY below
curl \
--silent \
--data-urlencode "query=${QUERY}" \
--data "start=${START}" \
--data "end=${END}" \
--data "step=${STEP}" \
"http://${HOST}:${PORT}/api/v1/query_range" \
| jq -r '.data.result[].values[][1]' \
| sort \
| uniq -c

# Cloud Functions
QUERY="min_over_time(gcp_cloud_functions_functions[15m]>0"
    348 1
     60 2

# Cloud Run
QUERY="min_over_time(gcp_cloud_run_services[15m])>0"
     60 10
    609 2

My interpretation of the above is that, for both alerts, the data (admittedly stepped in 5-minute increments) never includes zeros and thus is always >0 for both queries. Yes, the alerts don't fire.

Questions

  1. Why is this?
  2. What is my misunderstanding?
  3. Is there a better way for me to debug this?

prometheus.yml:

global:
  scrape_interval: 1m
  scrape_timeout: 10s
  evaluation_interval: 1m
alerting:
  alertmanagers:
  - follow_redirects: true
    enable_http2: true
    scheme: http
    timeout: 10s
    api_version: v2
    static_configs:
    - targets:
      - localhost:9093
rule_files:
- /etc/alertmanager/rules.yml
scrape_configs:
- job_name: gcp-exporter
  honor_timestamps: true
  scrape_interval: 15m
  scrape_timeout: 30s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  enable_http2: true
  static_configs:
  - targets:
    - localhost:9402

rules.yml:

groups:
- name: gcp_exporter
  rules:
  - alert: gcp_cloud_functions_running
    expr: min_over_time(gcp_cloud_functions_functions{}[15m]) > 0
    for: 6h
    labels:
      severity: page
    annotations:
      summary: GCP Cloud Functions running
  - alert: gcp_cloud_run_services_running
    expr: min_over_time(gcp_cloud_run_services{}[15m]) > 0
    for: 6h
    labels:
      severity: page
    annotations:
      summary: GCP Cloud Run services running
Runtime Information
{
  "status": "success",
  "data": {
    "startTime": "2022-07-21T16:14:23.941571056Z",
    "CWD": "/prometheus",
    "reloadConfigSuccess": true,
    "lastConfigTime": "2022-07-21T16:14:23Z",
    "corruptionCount": 0,
    "goroutineCount": 43,
    "GOMAXPROCS": 4,
    "GOGC": "",
    "GODEBUG": "",
    "storageRetention": "15d"
  }
}
Build Information
{
  "status": "success",
  "data": {
    "version": "2.37.0",
    "revision": "b41e0750abf5cc18d8233161560731de05199330",
    "branch": "HEAD",
    "buildUser": "root@0ebb6827e27f",
    "buildDate": "20220714-15:19:21",
    "goVersion": "go1.18.4"
  }
}
Alertmanagers
{
  "status": "success",
  "data": {
    "activeAlertmanagers": [
      {
        "url": "http://localhost:9093/api/v2/alerts"
      }
    ],
    "droppedAlertmanagers": []
  }
}

Upvotes: 0

Views: 4021

Answers (1)

DazWilkin
DazWilkin

Reputation: 40336

I'm confident that the underlying issue was my mistaken use of overly long scrape_interval values.

I'd had:

scrape_configs:
- job_name: gcp-exporter
  scrape_interval: 15m

I realized that this results in periods of 15-minute of measurements of alerts with 1-minute gaps:

enter image description here

Yesterday, I read Brian Brazil's Keep It Simple scrape_interval and I reverted the scrape_interval to the global value of scrape_interval: 1m.

Now, the ALERT metrics appear continuous and, most importantly, alerts are firing as expected:

enter image description here

Upvotes: 0

Related Questions