I have a systemd service that runs a SQL script that refreshes a bunch of materialized views. The service runs every 5 minutes. The service is deployed by ansible. If the SQL script fails, I would like to send an email, which notifies us of failure. The current code (see the yml chunk below) will send the notification email every 5 min or so, until one of us fixes the problem or stops the service. This is too frequent: one email is enough, and is exactly what I need. How can I send one and only one email, even if the script fails repeatedly? I am considering using a wrapper script such as this pseudocode, but it looks ugly: # Runs every 5 min: psql -f refresh_matviews.sql || touch refresh_matviews.failed.log if { exists refresh_matviews.failed.log } and { not grep "seen" refresh_matviews.failed.log } then echo "failed!" | mail admin@foo.com echo "seen" > refresh_matviews.failed.log If such a wrapper script is used, then whoever fixes the problem needs to also manually clear the (now outdated) failure file ( rm refresh_matviews.failed.log ), so that any new failure triggers a new email. The relevant chunk of the yml file for ansible : - name: Add systemd service that refreshes matviews copy: content: | # This service unit refreshes matviews # [Unit] Description=Refreshes matviews Wants=refresh_matviews.timer [Service] User=galaxy Type=oneshot ExecStart=/bin/bash -c '/usr/bin/psql ... -f /path/to/refresh_matviews.sql || echo 'WARNING' | /usr/bin/mail -s "not ok: refresh matviews" admin@foo.com' [Install] WantedBy=multi-user.target dest: /etc/systemd/system/refresh_matviews.service owner: root group: root mode: 0644 - name: Add systemd timer that refreshes matviews copy: content: | # This timer unit refreshes matviews # [Unit] Description=Refreshes matviews Requires=refresh_matviews.service [Timer] Unit=refresh_matviews.service OnCalendar=*-*-* *:00/5:00 [Install] WantedBy=timers.target dest: /etc/systemd/system/refresh_matviews.timer owner: root group: root mode: 0644 It seems that ansible / systemd should have something similar to what I need, but this is all I could find: Systemd "OnFailure=" not starting when binary or bash exits with an error code systemd action when a service fails? Proper way to use OnFailure in systemd

Sending a single notification email even if a scheduled job fails many times

Reputation: 12120

In respect to the given use case description

... until one of us fixes the problem or stops the service.

and the comment which was made, you may consider the SQL script error status as a fact about the system. So you could simply introduce a Custom Fact

add dynamic facts by adding executable scripts to facts.d.

so the next time fact gathering is run, your facts will just include the script error status and you can proceed further with Conditionals based on ansible_facts.

Even if currently

... the service (annot.: only) is deployed by Ansible.

this approach will help to maintain the status via Ansible as well a separate script which is sending email alerts.

Regarding

... but could not figure out how exactly I can use custom facts and conditionals based on ansible facts in my specific case.

I've added some more specific information for How to implement and use a Custom Fact? and I am focus on the Ansible part only.

Use Case and Rapid Prototype

It is assumed that there is an Ansible Tower installation
Which is running in High Availability
The applicaiton database backend is a separate dedicated installation
With Streaming Replication implemented
All nodes already integrated in a separate Monitoring Infrastructure
For this example interested only in the status of Streaming Replication on the Database Secondary Node
I like to implement in Ansible something which restarts the Streaming Replication if it has stopped, in example by network events
To do so, I am interested in the fact if the Streaming Replication is GOOD / OK or not

How to implement Custom Facts?

First I need to find out the Streaming Replication Status on Secondary Node. This can be run as script or cronjob on the node frequently.

psql -c "SELECT pg_is_in_recovery(),pg_is_wal_replay_paused(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp()" -x -t

Credits to

For facts.d or local facts and in order

To use facts.d, create an /etc/ansible/facts.d directory on the remote host or hosts. ... Add files to the directory to supply your custom facts. All file names must end with .fact. The files can be JSON, INI, or executable files returning JSON.

Since I am going to do further processing with Ansible, Python, I like to format the result in JSON before (pre-process) as it will make processing easier later.

psql -c "SELECT json_agg(t) FROM (SELECT pg_is_in_recovery(),pg_is_wal_replay_paused(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp()) t" -x -t | cut -d "|" -f 2

The output can be done directly into the fact file via

> /etc/ansible/fact.d/streaming.fact

depending on testing and outcome, add-on's could be

| tr -d "[:blank:]\n"
# or
| tr -d "[:blank:][]\n" # <- I've used this in my example

or even the return code of script or cronjob. Like in this clumsy

; echo "{\"rc\":\"${?}\"}" > /etc/ansible/fact.d/script.fact

or something via

./script.sh; jq --null-input --monochrome-output --arg rc "$?" '$ARGS.named' > /etc/ansible/facts.d/script.fact

Credits to

So far the implementation, which produces two files on the Secondary Node.

~/test$ tree /etc/ansible/facts.d/
/etc/ansible/facts.d/
├── script.fact
└── streaming.fact

~/test$ cat /etc/ansible/facts.d/script.fact
{"rc":"0"}

~/test$ cat /etc/ansible/facts.d/streaming.fact
{"pg_is_in_recovery":true,"pg_is_wal_replay_paused":false,"pg_last_wal_receive_lsn":"1/AB2345CD","pg_last_wal_replay_lsn":"1/AB2345CD","pg_last_xact_replay_timestamp":"2023-02-01T09:00:00.00000+01:00"}

How to use the Custom Fact?

A minimal example playbook

---
- hosts: localhost
  become: false
  gather_facts: true

  tasks:

  - name: Show Facts
    debug:
      msg: "{{ ansible_facts.ansible_local }}"

will result into an output of

TASK [Show Facts] ****************************************************
ok: [localhost] =>
  msg:
    script:
      rc: '0'
    streaming:
      pg_is_in_recovery: true
      pg_is_wal_replay_paused: false
      pg_last_wal_receive_lsn: 1/AB2345CD
      pg_last_wal_replay_lsn: 1/AB2345CD
      pg_last_xact_replay_timestamp: '2023-02-01T09:00:00.00000+01:00'

or in case of a failure of fact file generation

TASK [Gathering Facts] **********************************************************************************
[WARNING]: error loading facts as JSON or ini - please check content: /etc/ansible/facts.d/script.fact
ok: [localhost]

TASK [Show Facts] ***************************************************************************************
ok: [localhost] =>
  msg:
    script: 'error loading facts as JSON or ini - please check content: /etc/ansible/facts.d/script.fact'

A Conditional based on (even custom) ansible_facts could then look like

  - name: Show Facts
    debug:
      msg: "{{ ansible_facts.ansible_local.streaming }}"
    when: not ansible_facts.ansible_local.script.rc | bool # if there was no failure, rc=0

In case of failed script run producing an Exit Code 1 and a fact file content of rc: 1 it would just skip the task.

Upvotes: 1

Sending a single notification email even if a scheduled job fails many times

Answers (2)

Related Questions