Dominik
Dominik

Reputation: 2551

Google Cloud Monitoring: Why doesn't this alert-policy fire? (policy for “logs-ingestion higher than last week")

I want to get alerted when my services create significantly more logs than last week. A simple threshold-check would be to insensitive as the log-rate resembles a sine-wave (going up and down over the day). If you know a better way to achieve this, I would be happy to use this, too.

That’s my approach (in GCP’s “Monitoring Query Language/MQL") which doesn't create any incidents:

fetch global::logging.googleapis.com/billing/bytes_ingested
| align delta_gauge(1m)
| { t_0: ident
  ; t_1: time_shift 1w }
| join
| value
    [t_0_value_bytes_ingested_mean_sub:
       sub(t_0.value.bytes_ingested, t_1.value.bytes_ingested)]
| condition ge(t_0_value_bytes_ingested_mean_sub, 10'MiBy’)

Diagram of the metrics-output from the query

Here’s the full AlertPolicy (got via REST-API):

{
  "name": "projects/[PROJECT_ID_OR_NUMBER]/alertPolicies/[ALERT_POLICY_ID]",
  "displayName": "Logs: \u003e10MiB/min. compared to last week",
  "combiner": "OR",
  "creationRecord": {
    "mutateTime": "2022-06-16T14:15:51.572165064Z",
    "mutatedBy": "[REDACTED]"
  },
  "mutationRecord": {
    "mutateTime": "2022-06-24T10:14:45.366847354Z",
    "mutatedBy": "[REDACTED]"
  },
  "conditions": [
    {
      "displayName": "Logs: \u003e10MiB/min. compared to last week",
      "name": "projects/[PROJECT_ID_OR_NUMBER]/alertPolicies/[ALERT_POLICY_ID]/conditions/6368804464715103184",
      "conditionMonitoringQueryLanguage": {
        "query": "fetch global::logging.googleapis.com/billing/bytes_ingested\n| align delta_gauge(1m)\n| { t_0: ident\n  ; t_1: time_shift 1w }\n| join\n| value\n    [t_0_value_bytes_ingested_mean_sub:\n       sub(t_0.value.bytes_ingested, t_1.value.bytes_ingested)]\n| condition ge(t_0_value_bytes_ingested_mean_sub, 10'MiBy')",
        "duration": "0s",
        "trigger": {
          "count": 1
        }
      }
    }
  ],
  "documentation": {
    "content": "Do sth.",
    "mimeType": "text/markdown"
  },
  "notificationChannels": [
    "projects/[PROJECT_ID_OR_NUMBER]/notificationChannels/[REDACTED]"
  ],
  "enabled": true,
  "alertStrategy": {
    "autoClose": "604800s"
  }
}

Upvotes: 0

Views: 501

Answers (1)

André Laszlo
André Laszlo

Reputation: 15537

The limits for alerts seems to have changed recently. I think that one of my alert policies was disabled because it was comparing values with the previous week, just like yours.

It's unfortunate, since comparing to the previous week is often a really simple way to detect anomalies. Now we're limited to the previous day, and we might get a lot of false positives on Mondays since traffic volume is lower on weekends.

I get that huge windows are expensive to evaluate, but with the time_shift operator, I don't really think that's true since we effectively compare two small windows?

For more information on alerting limits, see https://cloud.google.com/monitoring/quotas#alerting_uptime_limits

Upvotes: 0

Related Questions