Reputation: 1428
I am trying to optimize my monitoring system - we are using Geneos ITRS.
I have a sample with two endpoints (let's call them port100 and port101) and tried to create a rule which will send a single alert if any of these ports is failed.
Rule configured to send an email if the status is changed with success or failure message.
Sample configure to run every 60 seconds.
And for some time it was work - if I stop any of service - an email was generated. If I stop two services - two emails were generated. If I run the sample manually - an email was generated.
But at some point, I did something wrong and got an infinity loop with receiving few thousands emails per minute until I restarted the ITRS gateway and disabled this rule.
Can anybody explain why it has happened? I believe that rule block should be triggered only on sample execute, and there are no commands to create loops in ITRS block syntax, so I am not sure that I understand how "rule block" is connected with samplers.
The code example (!!! PLEASE DO NOT RUN IT IN PRODUCTION, IT COULD HARM YOUR GATEWAY !!!):
set $(myStatus) "OK"
if path "port100" value <> "OK" then
set $(myStatus) value
endif
if path "port101" value <> "OK" then
set $(myStatus) value
endif
if $(myStatus) <> "OK" and severity = ok then
severity warning
userdata "Subject" "something is wrong"
run "SendEmail"
elseif previous severity <> ok then
severity ok
userdata "Subject" "Everything is ok"
run "SendEmail"
endif
I see few non-critical things in the script that could be fixed (like set $(myStatus) "NotOk" and there is no need to compare with previous severity status), but I prefer to show the original "bad script" just to provide all pieces of evidence.
Kindly help me with the following:
Why I got a loop with email alerts?
Why it is a thousand emails per minute instead of one or two per minute (sample with two endpoints with interval 60 seconds produces only two sample execution per minute)
(minor question) How to monitor several endpoints and generate one single alert if one or several of them are unavailable
Thank you in forward.
P.S. If I understand correctly, that rule block should be triggered only with sample execution, it may be a bug in ITRS?
Upvotes: 0
Views: 807
Reputation: 1428
It seems, I have found the answer for the questions 1 and 2 in official documentation (https://docs.itrsgroup.com/docs/geneos/5.9.0/Gateway_Reference_Guide/geneos_rulesactionsalerts_tr.html):
Note: It is important to understand that when part of a rule is triggered and fires, resets an action or changes some attribute, the rule will be re-evaluated. This is of particular significance when using the previous keyword, as it will only access the previous value of the attribute whose change triggered the rule evaluation. For any other attribute, previous will access the current value. Using the previouskeyword for attributes in a rule which are changed by the rule itself may cause duplicate actions as the rule will be re-evaulated multiple times.
So if I tried to check previous severity status and change severity in my rule, it cause re-evaluate the rule and I got infinity loop.
Upvotes: 1