Timur Shtatland
Timur Shtatland

Reputation: 12395

Sending a single notification email even if a scheduled job fails many times

I have a systemd service that runs a SQL script that refreshes a bunch of materialized views. The service runs every 5 minutes. The service is deployed by ansible. If the SQL script fails, I would like to send an email, which notifies us of failure. The current code (see the yml chunk below) will send the notification email every 5 min or so, until one of us fixes the problem or stops the service. This is too frequent: one email is enough, and is exactly what I need.

How can I send one and only one email, even if the script fails repeatedly?

I am considering using a wrapper script such as this pseudocode, but it looks ugly:

# Runs every 5 min:
psql -f refresh_matviews.sql || touch refresh_matviews.failed.log
if { exists refresh_matviews.failed.log }
   and { not grep "seen" refresh_matviews.failed.log } then
     echo "failed!" | mail [email protected]
     echo "seen" > refresh_matviews.failed.log

If such a wrapper script is used, then whoever fixes the problem needs to also manually clear the (now outdated) failure file (rm refresh_matviews.failed.log), so that any new failure triggers a new email.

The relevant chunk of the yml file for ansible:

  - name: Add systemd service that refreshes matviews
    copy:
      content: |
          # This service unit refreshes matviews
          #
          [Unit]
          Description=Refreshes matviews
          Wants=refresh_matviews.timer
                
          [Service]
          User=galaxy
          Type=oneshot
          ExecStart=/bin/bash -c '/usr/bin/psql ... -f /path/to/refresh_matviews.sql || echo 'WARNING' | /usr/bin/mail -s "not ok: refresh matviews" [email protected]'
          [Install]
          WantedBy=multi-user.target
      dest: /etc/systemd/system/refresh_matviews.service
      owner: root
      group: root
      mode: 0644
  - name: Add systemd timer that refreshes matviews
    copy:
      content: |
        # This timer unit refreshes matviews
        #
        
        [Unit]
        Description=Refreshes matviews
        Requires=refresh_matviews.service
              
        [Timer]
        Unit=refresh_matviews.service
        OnCalendar=*-*-* *:00/5:00
                  
        [Install]
        WantedBy=timers.target
      dest: /etc/systemd/system/refresh_matviews.timer
      owner: root
      group: root
      mode: 0644

It seems that ansible/systemd should have something similar to what I need, but this is all I could find:

Upvotes: 5

Views: 750

Answers (2)

U880D
U880D

Reputation: 12120

In respect to the given use case description

... until one of us fixes the problem or stops the service.

and the comment which was made, you may consider the SQL script error status as a fact about the system. So you could simply introduce a Custom Fact

add dynamic facts by adding executable scripts to facts.d.

so the next time fact gathering is run, your facts will just include the script error status and you can proceed further with Conditionals based on ansible_facts.

Even if currently

... the service (annot.: only) is deployed by Ansible.

this approach will help to maintain the status via Ansible as well a separate script which is sending email alerts.


Regarding

... but could not figure out how exactly I can use custom facts and conditionals based on ansible facts in my specific case.

I've added some more specific information for How to implement and use a Custom Fact? and I am focus on the Ansible part only.

Use Case and Rapid Prototype

  • It is assumed that there is an Ansible Tower installation
  • Which is running in High Availability
  • The applicaiton database backend is a separate dedicated installation
  • With Streaming Replication implemented
  • All nodes already integrated in a separate Monitoring Infrastructure
  • For this example interested only in the status of Streaming Replication on the Database Secondary Node
  • I like to implement in Ansible something which restarts the Streaming Replication if it has stopped, in example by network events
  • To do so, I am interested in the fact if the Streaming Replication is GOOD / OK or not

How to implement Custom Facts?

First I need to find out the Streaming Replication Status on Secondary Node. This can be run as script or cronjob on the node frequently.

psql -c "SELECT pg_is_in_recovery(),pg_is_wal_replay_paused(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp()" -x -t

Credits to

For facts.d or local facts and in order

To use facts.d, create an /etc/ansible/facts.d directory on the remote host or hosts. ... Add files to the directory to supply your custom facts. All file names must end with .fact. The files can be JSON, INI, or executable files returning JSON.

Since I am going to do further processing with Ansible, Python, I like to format the result in JSON before (pre-process) as it will make processing easier later.

psql -c "SELECT json_agg(t) FROM (SELECT pg_is_in_recovery(),pg_is_wal_replay_paused(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp()) t" -x -t | cut -d "|" -f 2

The output can be done directly into the fact file via

> /etc/ansible/fact.d/streaming.fact

depending on testing and outcome, add-on's could be

| tr -d "[:blank:]\n"
# or
| tr -d "[:blank:][]\n" # <- I've used this in my example

or even the return code of script or cronjob. Like in this clumsy

; echo "{\"rc\":\"${?}\"}" > /etc/ansible/fact.d/script.fact

or something via

./script.sh; jq --null-input --monochrome-output --arg rc "$?" '$ARGS.named' > /etc/ansible/facts.d/script.fact

Credits to

So far the implementation, which produces two files on the Secondary Node.

~/test$ tree /etc/ansible/facts.d/
/etc/ansible/facts.d/
├── script.fact
└── streaming.fact

~/test$ cat /etc/ansible/facts.d/script.fact
{"rc":"0"}

~/test$ cat /etc/ansible/facts.d/streaming.fact
{"pg_is_in_recovery":true,"pg_is_wal_replay_paused":false,"pg_last_wal_receive_lsn":"1/AB2345CD","pg_last_wal_replay_lsn":"1/AB2345CD","pg_last_xact_replay_timestamp":"2023-02-01T09:00:00.00000+01:00"}

How to use the Custom Fact?

A minimal example playbook

---
- hosts: localhost
  become: false
  gather_facts: true

  tasks:

  - name: Show Facts
    debug:
      msg: "{{ ansible_facts.ansible_local }}"

will result into an output of

TASK [Show Facts] ****************************************************
ok: [localhost] =>
  msg:
    script:
      rc: '0'
    streaming:
      pg_is_in_recovery: true
      pg_is_wal_replay_paused: false
      pg_last_wal_receive_lsn: 1/AB2345CD
      pg_last_wal_replay_lsn: 1/AB2345CD
      pg_last_xact_replay_timestamp: '2023-02-01T09:00:00.00000+01:00'

or in case of a failure of fact file generation

TASK [Gathering Facts] **********************************************************************************
[WARNING]: error loading facts as JSON or ini - please check content: /etc/ansible/facts.d/script.fact
ok: [localhost]

TASK [Show Facts] ***************************************************************************************
ok: [localhost] =>
  msg:
    script: 'error loading facts as JSON or ini - please check content: /etc/ansible/facts.d/script.fact'

A Conditional based on (even custom) ansible_facts could then look like

  - name: Show Facts
    debug:
      msg: "{{ ansible_facts.ansible_local.streaming }}"
    when: not ansible_facts.ansible_local.script.rc | bool # if there was no failure, rc=0

In case of failed script run producing an Exit Code 1 and a fact file content of rc: 1 it would just skip the task.

Upvotes: 1

Zeitounator
Zeitounator

Reputation: 44760

Enable a persistent fact cache backend in your project ansible.cfg. See this entrypoint in documentation. For the example I use a simple json file cache:

[defaults]
fact_caching=jsonfile
fact_caching_connection=/tmp/ansible_cache

Once the cache is enabled, the idea is to:

  1. Get the status of the possibly failing task
  2. Send an email if the task hasn't failed before
  3. Register the status of the task

Here is some pseudo code to give your the idea:

- name: My maybe failing tasks
  command: /bin/false
  ignore_error: true
  register: my_cmd

- name: Send an email if relevant
  mail:
    # Your mail task options
  when:
    - previous_run_ok | d(true) | bool
    - my_cmd is failed

- name: Register result of this run for later (cached fact)
  set_fact:
    previous_run_ok: "{{ my_cmd is success }}"

Upvotes: 3

Related Questions