Reputation: 2551
I want to get alerted when my services create significantly more logs than last week. A simple threshold-check would be to insensitive as the log-rate resembles a sine-wave (going up and down over the day). If you know a better way to achieve this, I would be happy to use this, too.
That’s my approach (in GCP’s “Monitoring Query Language/MQL") which doesn't create any incidents:
fetch global::logging.googleapis.com/billing/bytes_ingested
| align delta_gauge(1m)
| { t_0: ident
; t_1: time_shift 1w }
| join
| value
[t_0_value_bytes_ingested_mean_sub:
sub(t_0.value.bytes_ingested, t_1.value.bytes_ingested)]
| condition ge(t_0_value_bytes_ingested_mean_sub, 10'MiBy’)
Here’s the full AlertPolicy
(got via REST-API):
{
"name": "projects/[PROJECT_ID_OR_NUMBER]/alertPolicies/[ALERT_POLICY_ID]",
"displayName": "Logs: \u003e10MiB/min. compared to last week",
"combiner": "OR",
"creationRecord": {
"mutateTime": "2022-06-16T14:15:51.572165064Z",
"mutatedBy": "[REDACTED]"
},
"mutationRecord": {
"mutateTime": "2022-06-24T10:14:45.366847354Z",
"mutatedBy": "[REDACTED]"
},
"conditions": [
{
"displayName": "Logs: \u003e10MiB/min. compared to last week",
"name": "projects/[PROJECT_ID_OR_NUMBER]/alertPolicies/[ALERT_POLICY_ID]/conditions/6368804464715103184",
"conditionMonitoringQueryLanguage": {
"query": "fetch global::logging.googleapis.com/billing/bytes_ingested\n| align delta_gauge(1m)\n| { t_0: ident\n ; t_1: time_shift 1w }\n| join\n| value\n [t_0_value_bytes_ingested_mean_sub:\n sub(t_0.value.bytes_ingested, t_1.value.bytes_ingested)]\n| condition ge(t_0_value_bytes_ingested_mean_sub, 10'MiBy')",
"duration": "0s",
"trigger": {
"count": 1
}
}
}
],
"documentation": {
"content": "Do sth.",
"mimeType": "text/markdown"
},
"notificationChannels": [
"projects/[PROJECT_ID_OR_NUMBER]/notificationChannels/[REDACTED]"
],
"enabled": true,
"alertStrategy": {
"autoClose": "604800s"
}
}
Upvotes: 0
Views: 501
Reputation: 15537
The limits for alerts seems to have changed recently. I think that one of my alert policies was disabled because it was comparing values with the previous week, just like yours.
It's unfortunate, since comparing to the previous week is often a really simple way to detect anomalies. Now we're limited to the previous day, and we might get a lot of false positives on Mondays since traffic volume is lower on weekends.
I get that huge windows are expensive to evaluate, but with the time_shift
operator, I don't really think that's true since we effectively compare two small windows?
For more information on alerting limits, see https://cloud.google.com/monitoring/quotas#alerting_uptime_limits
Upvotes: 0