Terraform/Datadog Alert Monitoring

Question

I am trying to create an alert Datadog using Terraform for when multiple hosts (1 or more) are at >= 95% CPU usage. So far, with the code I have, the alert would trigger anytime a host exceeds the threshold and that is a little too noisy. Would you happen to know how to create the logic to satisfy both conditions before the alert gets triggered? (Alert when Multiple hosts at 95% CPU or higher)

resource "datadog_monitor" "worker_high_disk_usage" {
    type    = "metric alert"
    name    = "worker high disk usage"
    message = <<-EOT
    {{#is_alert}} 
    @slack_channel {{system}} {{env}} host {{host.name}} device {{device}} has had disk usage 
    enter code hereover {{threshold}} of availible disk space for the last 30m
    {{/is_alert}} 
    {{#is_recovery}}
    @pagerduty
    {{system}} {{env}} host {{host.name}} device {{device}} high disk usage resolved.
    {{/is_recovery}}
    EOT
    query   = "min(last_30m):avg:system.disk.in_use{env:prod,system:worker,team:team} by 
    {host,device} > 0.95"

    thresholds = {
    critical = 0.95

    timeout_h           = 1
  
    require_full_window = false
      lifecycle {
        ignore_changes = [silenced]
      }
      tags = ["disk"]
    }

Terraform/Datadog Alert Monitoring

Answers (1)

Related Questions