Reputation: 17392
We're using Riemann and Riemann-health to monitor our servers. However now I get quite a lot of CPU critical warnings, because the CPU peaked for a very short time - This is nothing I even need to know about I think. From my understanding, a constant high CPU usage will increase the load avg, which will be reported as well and sounds way more useful.
I don't want to disable reporting the CPU, just every level should be considered to be ok. If possible, I'd like to change the events on the Riemann server, so I don't have to change all the servers.
Here our Riemann config: https://gist.github.com/iGEL/e352764a8c559440c851
Upvotes: 6
Views: 233
Reputation: 4470
I don't have a full solution, but in theory you should be able to filter your CPU related events via a where
function and set the state unconditionally to "ok" using with
as follows:
(streams
(where (service #"cpu")
(with :state "ok" index)))
On the other hand, relying on the load average is not a good idea since a high load average can also mean that a large number of processes are waiting for IO.
Instead of silencing CPU alerts, you could alert only if CPU is not in state ok for more than X time units. Even better, alert on a higher-level metric representing a client-impacting issue, such as response latency, http status codes, error levels etc. After all, if CPU is high, but there's no impact on the system, an alert is likely just noise.
Upvotes: 0