Thiago Vinicius
Thiago Vinicius

Reputation: 1

Puppet: Recovering from failed run

Consider the following code:

file { '/etc/systemd/system/docker.service.d/http-proxy.conf':
  ensure  => 'present',
  owner   => 'root',
  group   => 'root',
  mode    => '644',
  content => '[Service]
Environment="HTTP_PROXY=http://10.0.2.2:3128"
Environment="HTTPS_PROXY=http://10.0.2.2:3128"
',
  notify  => Exec['daemon-reload'],
  require => Package['docker-ce'],
}

exec { 'daemon-reload':
  command     => 'systemctl daemon-reload',
  path        => '/sbin',
  refreshonly => true,
}

service { 'docker':
  ensure    => 'running',
  subscribe => File['/etc/systemd/system/docker.service.d/http-proxy.conf'],
  require   => Exec['daemon-reload'],
}

I would like to edit some systemd service. In this instance, it is the environment for docker, but it could be any other need.

Since a systemd unit file has been changed, systemctl daemon-reload must be run for the new configuration to be picked up.

Running puppet apply fails:

Notice: Compiled catalog for puppet-docker-test.<redacted> in environment production in 0.18 seconds
Notice: /Stage[main]/Main/File[/etc/systemd/system/docker.service.d/http-proxy.conf]/ensure: defined content as '{md5}dace796a9904d2c5e2c438e6faba2332'
Error: /Stage[main]/Main/Exec[daemon-reload]: Failed to call refresh: Could not find command 'systemctl'
Error: /Stage[main]/Main/Exec[daemon-reload]: Could not find command 'systemctl'
Notice: /Stage[main]/Main/Service[docker]: Dependency Exec[daemon-reload] has failures: false
Warning: /Stage[main]/Main/Service[docker]: Skipping because of failed dependencies
Notice: Applied catalog in 0.15 seconds

The cause is immediately obvious: systemctl lives in /bin, not /sbin, as configured. However, fixing this, then running puppet apply again will neither cause the service to be restarted nor systemctl daemon-reload to be run:

Notice: Compiled catalog for puppet-docker-test.<redacted> in environment production in 0.19 seconds
Notice: Applied catalog in 0.16 seconds

Apparently, this happens because there were no changes to the file resource (since it was applied on the failed run), which would have refreshed the daemon-reload and then triggered the service to restart.

In order to force puppet to reload the service and restart it, I could change the contents of the file on disk, I could change the contents on the puppet code, but it feels like I'm missing some better way of doing this.

How to better recover from such scenario? Or, how to write puppet code that doesn't have this issue?

Upvotes: 0

Views: 798

Answers (2)

Manoj K Samtani
Manoj K Samtani

Reputation: 1

As your http-haproxy.conf has already been updated on 1st execution of puppet, so next time this will not get updated again and notify to systemctl daemon-reload will not be done

In case you need to run systemctl daemon-reload all the time if any notify there or not then you need to more refreshonly => true from exec 'daemon-reload'

Upvotes: 0

John Bollinger
John Bollinger

Reputation: 181159

Puppet does not provide a mechanism for resuming a failed run. Doing so would not make much sense to me, since one would expect that for a resumption to have a different result would depend on the machine state being changed since the failure, and a machine-state change outside Puppet potentially invalidates the catalog that was being applied.

The agent does, by default, send run reports to the master, so in the event of a failed run, you should be able to determine from that what went wrong. Supposing that you don't want to scour reports to figure out how to recover from a failed run, however, you could consider compiling a recovery script.

For example, you know that any failure may have caused a daemon-reload to be missed, and it's harmless to perform one when it isn't required, so just put that in your script. You might also put in a restart of each service under management. Basically, you're looking for anything that has non-trivial refresh behavior (Execs and Services are the main ones that come to my mind).

It occurs to me that if you're extra clever then it might be possible to put that in the form of one or more Puppet classes, and to determine on the master whether the last run for the target node failed, so as to apply the recovery.

As for avoiding the problem in the first place, I can only suggest testing, testing, and more testing. To that end, if you don't have dedicated machines on which to test Puppet updates, then at least select a small number of normal machines that get updates first, so that any problems that arise are closely contained.

Upvotes: 0

Related Questions