Richard
Richard

Reputation: 15

Terraform/Datadog Alert Monitoring

I am trying to create an alert Datadog using Terraform for when multiple hosts (1 or more) are at >= 95% CPU usage. So far, with the code I have, the alert would trigger anytime a host exceeds the threshold and that is a little too noisy. Would you happen to know how to create the logic to satisfy both conditions before the alert gets triggered? (Alert when Multiple hosts at 95% CPU or higher)

resource "datadog_monitor" "worker_high_disk_usage" {
    type    = "metric alert"
    name    = "worker high disk usage"
    message = <<-EOT
    {{#is_alert}} 
    @slack_channel {{system}} {{env}} host {{host.name}} device {{device}} has had disk usage 
    enter code hereover {{threshold}} of availible disk space for the last 30m
    {{/is_alert}} 
    {{#is_recovery}}
    @pagerduty
    {{system}} {{env}} host {{host.name}} device {{device}} high disk usage resolved.
    {{/is_recovery}}
    EOT
    query   = "min(last_30m):avg:system.disk.in_use{env:prod,system:worker,team:team} by 
    {host,device} > 0.95"

    thresholds = {
    critical = 0.95

    timeout_h           = 1
  
    require_full_window = false
      lifecycle {
        ignore_changes = [silenced]
      }
      tags = ["disk"]
    }

Upvotes: 0

Views: 1304

Answers (1)

eilon47
eilon47

Reputation: 51

Not sure if this will work but you can give it a try..:

  1. create 2 instances of the same monitor mentioned about
  2. create a composite monitor based on them both
  3. trigger the composite only when a.value is not the same as b.value

{{^is_exact_match a.value b.value }}

@[email protected] Alert 2 hosts has passed the threshold

{{/is_exact_match}}

same value - ignore - do nothing


The problem is that you probably might get 2 alerts at the same time...

Upvotes: 0

Related Questions