Golide
Golide

Reputation: 1009

Multi-tenant Loki Ruler not sending alerts to Multi-tenant Mimir AlertManager

I have an AlertRule set up in my Loki-Distributed instance via the Loki ruler as below:

ruler:
  directories:
    fake:
      rules.txt: |
        groups:
          - name: mimir_loki_test
            rules:
              - alert: LokiAlertsMimir 
                expr: |
                  sum(rate({node_name="ip-10-XXX-XXX-XXX.ec2.internal"} |~ ".*OOM-killed.*" [5m])) by (node_name)
                for: 3m
                labels:
                  severity: critical
                  job: mimir
                annotations:
                  summary: Loki Alerts with MimirAM - OOM Alert

Before I enabled multi-tenancy on the Loki helm chart the AlertRule was firing as expected and it was visible in the Mimir AlertManager UI. The UI is accessed via an Nginx Reverse proxy that sits in front of the Loki instance.

After enabling multi-tenancy on the Loki-Distributed chart , the AlertRule is no longer visible in Mimir AlertManager, nor is it firing. After enabling multi-tenancy part of the configuration loki-Distributed looks like below:

gateway:
  # -- Specifies whether the gateway should be enabled
  enabled: true
ingress:
        annotations:
          cert-manager.io/cluster-issuer: letsencrypt-prod
          external-dns.alpha.kubernetes.io/hostname: loki.${sre_domain}
          nginx.ingress.kubernetes.io/auth-secret: loki-basic-auth
          nginx.ingress.kubernetes.io/auth-type: basic
        enabled: true
        hosts:
          - host: loki.${sre_domain}
            paths:
              - path: /
                pathType: Prefix
        ingressClassName: nginx
        tls:
          - hosts:
              - loki.${sre_domain}
           secretName: loki.${sre_domain}-tls
      nginxConfig:
        httpSnippet: |-
          client_max_body_size 100M;
          proxy_connect_timeout       900;
          proxy_send_timeout          900;
          proxy_read_timeout          900;
          send_timeout                900;
        serverSnippet: |-
          client_max_body_size 100M;
          location ~ /loki/api/v1/rules/ { proxy_pass  http://loki-ruler.monitoring.svc.cluster.local:3100$request_uri; }
loki:
      config: |
        auth_enabled: true
        ruler:
          alertmanager_url: http://mimir-alertmanager.monitoring:8080/mimir-alertmanager
          external_url: https://monitoring.${sre_domain}/mimir-alertmanager
          enable_alertmanager_v2: true
          enable_api: true
          ring:
            kvstore:
              store: memberlist
          rule_path: /tmp/loki/scratch
          storage:
            local:
              directory: /etc/loki/rules
            type: local

If I check the Loki-Ruler pod, I can see that the Alertrule is evaluating correctly:

level=info ts=2024-07-18T08:06:46.114021053Z caller=compat.go:66 user=fake 

rule_name="LokiAlertsMimir" rule_type=alerting query="sum by (node_name)(rate({node_name=\"ip-10-XXX-XXX-XXX.ec2.internal\"} |~ \".*OOM-killed.*\"[5m]))" query_hash=2362249182 msg="evaluating rule"
level=info ts=2024-07-18T08:06:46.114066624Z caller=engine.go:232 component=ruler evaluation_mode=local org_id=fake msg="executing query" type=instant query="sum by (node_name)(rate({node_name=\"ip-10-XXX-XXX-XXX.ec2.internal\"} |~ \".*OOM-killed.*\"[5m]))" query_hash=2362249182
level=info ts=2024-07-18T08:06:46.115572328Z caller=metrics.go:160 component=ruler evaluation_mode=local org_id=fake latency=fast query="sum by (node_name)(rate({node_name=\"ip-10-XXX-XXX-XXX.ec2.internal\"} |~ \".*OOM-killed.*\"[5m]))" query_hash=2362249182 query_type=metric range_type=instant length=0s start_delta=2.679478ms end_delta=2.679618ms step=0s duration=1.433063ms status=200 limit=0 returned_lines=0 throughput=0B total_bytes=0B total_bytes_structured_metadata=0B lines_per_second=0 total_lines=0 post_filter_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 chunk_refs_fetch_time=1.033714ms cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s

If I check the Mimir-AlertManager pod, I see nothing related to the Alertrule, not even errors:

ts=2024-07-18T08:04:22.457955222Z caller=multitenant.go:546 level=info 
component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2024-07-18T08:04:37.457257717Z caller=multitenant.go:546 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"

It appears the AlertManager is completely ignoring the AlertRule. The receiver configuration in the AlertManager is configured correctly, and has already been validated before I turned on multitenancy.

There is a comment here that says in multitenancy mode “Loki forwards the X-Scope-OrgID header”. I am not quite sure what this exactly means … should I need to set the X-Scope-OrgID header against the Mimir Alertmanager in this case or this is already set on each POST call to the Mimir AlertManager URL ?

Additional information Loki-Distributed Chart: 0.79.1 Mimir-Distributed Chart: 5.4.0 Grafana Chart: 8.3.4’

I have many questions:

  1. Do I need to set X-Scope-OrgID header during the “remote push” operation of the alertrule by the Loki Ruler? 2.Or is there a way I can check whether my Mimir AlertManager is tenant-aware ?

  2. How do i check if the Loki ruler sending the tenantId to Mimir AlertManager, if at all?

What am I missing?

Upvotes: 0

Views: 372

Answers (0)

Related Questions