Reputation: 8869
Looking at the list of prometheus related metrics, I see
prometheus_build_info
prometheus_config_last_reload_success_timestamp_seconds
prometheus_config_last_reload_successful
prometheus_engine_queries
prometheus_engine_queries_concurrent_max
prometheus_engine_query_duration_seconds
prometheus_engine_query_duration_seconds_count
prometheus_engine_query_duration_seconds_sum
prometheus_evaluator_duration_seconds
prometheus_evaluator_duration_seconds_count
prometheus_evaluator_duration_seconds_sum
prometheus_evaluator_iterations_missed_total
prometheus_evaluator_iterations_skipped_total
prometheus_evaluator_iterations_total
prometheus_local_storage_checkpoint_duration_seconds_count
prometheus_local_storage_checkpoint_duration_seconds_sum
prometheus_local_storage_checkpoint_last_duration_seconds
prometheus_local_storage_checkpoint_last_size_bytes
prometheus_local_storage_checkpoint_series_chunks_written_count
prometheus_local_storage_checkpoint_series_chunks_written_sum
prometheus_local_storage_checkpointing
prometheus_local_storage_chunk_ops_total
prometheus_local_storage_chunks_to_persist
prometheus_local_storage_fingerprint_mappings_total
prometheus_local_storage_inconsistencies_total
prometheus_local_storage_indexing_batch_duration_seconds
prometheus_local_storage_indexing_batch_duration_seconds_count
prometheus_local_storage_indexing_batch_duration_seconds_sum
prometheus_local_storage_indexing_batch_sizes
prometheus_local_storage_indexing_batch_sizes_count
prometheus_local_storage_indexing_batch_sizes_sum
prometheus_local_storage_indexing_queue_capacity
prometheus_local_storage_indexing_queue_length
prometheus_local_storage_ingested_samples_total
prometheus_local_storage_maintain_series_duration_seconds
prometheus_local_storage_maintain_series_duration_seconds_count
prometheus_local_storage_maintain_series_duration_seconds_sum
prometheus_local_storage_memory_chunkdescs
prometheus_local_storage_memory_chunks
prometheus_local_storage_memory_dirty_series
prometheus_local_storage_memory_series
prometheus_local_storage_non_existent_series_matches_total
prometheus_local_storage_open_head_chunks
prometheus_local_storage_out_of_order_samples_total
prometheus_local_storage_persist_errors_total
prometheus_local_storage_persistence_urgency_score
prometheus_local_storage_queued_chunks_to_persist_total
prometheus_local_storage_rushed_mode
prometheus_local_storage_series_chunks_persisted_bucket
prometheus_local_storage_series_chunks_persisted_count
prometheus_local_storage_series_chunks_persisted_sum
prometheus_local_storage_series_ops_total
prometheus_local_storage_started_dirty
prometheus_local_storage_target_heap_size_bytes
prometheus_notifications_dropped_total
prometheus_notifications_errors_total
prometheus_notifications_latency_seconds
prometheus_notifications_latency_seconds_count
prometheus_notifications_latency_seconds_sum
prometheus_notifications_queue_capacity
prometheus_notifications_queue_length
prometheus_notifications_sent_total
prometheus_rule_evaluation_duration_seconds
prometheus_rule_evaluation_duration_seconds_count
prometheus_rule_evaluation_duration_seconds_sum
prometheus_rule_evaluation_failures_total
prometheus_sd_azure_refresh_duration_seconds
prometheus_sd_azure_refresh_duration_seconds_count
prometheus_sd_azure_refresh_duration_seconds_sum
prometheus_sd_azure_refresh_failures_total
prometheus_sd_consul_rpc_duration_seconds
prometheus_sd_consul_rpc_duration_seconds_count
prometheus_sd_consul_rpc_duration_seconds_sum
prometheus_sd_consul_rpc_failures_total
prometheus_sd_dns_lookup_failures_total
prometheus_sd_dns_lookups_total
prometheus_sd_ec2_refresh_duration_seconds
prometheus_sd_ec2_refresh_duration_seconds_count
prometheus_sd_ec2_refresh_duration_seconds_sum
prometheus_sd_ec2_refresh_failures_total
prometheus_sd_file_read_errors_total
prometheus_sd_file_scan_duration_seconds
prometheus_sd_file_scan_duration_seconds_count
prometheus_sd_file_scan_duration_seconds_sum
prometheus_sd_gce_refresh_duration
prometheus_sd_gce_refresh_duration_count
prometheus_sd_gce_refresh_duration_sum
prometheus_sd_gce_refresh_failures_total
prometheus_sd_kubernetes_events_total
prometheus_sd_marathon_refresh_duration_seconds
prometheus_sd_marathon_refresh_duration_seconds_count
prometheus_sd_marathon_refresh_duration_seconds_sum
prometheus_sd_marathon_refresh_failures_total
prometheus_sd_triton_refresh_duration_seconds
prometheus_sd_triton_refresh_duration_seconds_count
prometheus_sd_triton_refresh_duration_seconds_sum
prometheus_sd_triton_refresh_failures_total
prometheus_target_interval_length_seconds
prometheus_target_interval_length_seconds_count
prometheus_target_interval_length_seconds_sum
prometheus_target_scrape_pool_sync_total
prometheus_target_scrapes_exceeded_sample_limit_total
prometheus_target_skipped_scrapes_total
prometheus_target_sync_length_seconds
prometheus_target_sync_length_seconds_count
prometheus_target_sync_length_seconds_sum
prometheus_treecache_watcher_goroutines
prometheus_treecache_zookeeper_failures_total
None of them look like they directly give me the number I'm looking for.
The closest I've gotten is rate(prometheus_notifications_sent_total[1m])
which seems to give me the number of sent notifications in a 1 minute interval -- which isn't quite what I want because some notifications fire at different intervals, and also there's more noise in the data than I'd like.
I'd like to display on a grafana dashboard the number of prometheus notifications that are firing currently. Can I do this with a prometheus expression? If so, what should the expression look like?
EDIT:
By "firing" I mean, the number of alerts listed as active in the alerts dashboard on prometheus.
E.g.:
If you open up dropdown, you get an entry for each active alert, and the state says "FIRING". I think that's where I got the term "firing".
Upvotes: 1
Views: 3677
Reputation: 1495
To see all alerts active right now:
count(ALERTS{alertstate="firing"})
To see the number of a specific alert THE_NAME_OF_THE_ALERT
:
count(ALERTS{alertname="THE_NAME_OF_THE_ALERT",alertstate="firing"})
Another option, if you want to see what's failing even before an alert is triggered (which is maybe timed to trigger after 10 seconds failing):
count(probe_success == 0)
Upvotes: 2
Reputation: 11
Alerts are special metrics named ALERTS. I'm not familiar with Grafana, so I personally would use the http API to count the number of currently firing alerts like so:
curl -s 'http://prometheus-002:9090/api/v1/query?query=ALERTS{alertstate="firing"}' \
|grep -o '"__name__":' |wc -l
Maybe you could make a recording rule to make a meta-metric, and tell Grafana to measure that.
Upvotes: 1