How to clean up NRPE check when removing subordinate charm?

Hey folks, I am writing a small subordinate charm that adds an additional NRPE check in addition to the regular NRPE charm, but I am having an issue figuring out how to clean up the check when I remove my charm. Specifically when/where to call the remove_check from the charmhelpers.contrib.charmsupport.nrpe module when the charm is tearing down or removing its relations.

The check gets cleaned up if I run remove_check when the nrpe-external-master relation is available, but when I try to call the same function when the stop hook is running or when the nrpe-external-master relation is no longer available, the configuration on the Nagios unit never gets updated (and therefore the check does not get cleaned up).

I have a minimal working example charm called extra-nrpe-example which has a bundle to demonstrate the issue. Here are the steps:

# Build charm
make build

# Deploy example bundle with Nagios
make deploy-bundle

# Call remove_check when nrpe-external-master.available works and cleans up on Nagios
juju config extra-nrpe-example remove-check=true

# Add check again
juju config extra-nrpe-example --reset remove-check

# Remove application does not work and does not clean up on Nagios unit
juju remove-application extra-nrpe-example

I am probably missing something obvious on the relation or logic side, but any help or input would be appreciated.

Any thoughts @stub? Pinging you as you’ve been involved with the NRPE charm.

Looking at interface:nrpe-external-master, I think the only supported way is to do it in a handler decorated with @when_not(‘nrpe-external-master.available’). I don’t know if or when juju calls stop hooks, but removing the subordinate should first remove the nrpe-external-master relation, calling the nrpe-external-master-departed and -broken hooks. This should trigger the nrpe-external-master.available flag to be cleared.

Hey @stub, thanks for looking into this.

I tried to handle this with @when_not(‘nrpe-external-master.available’) here, but the issue seems to be that this is too late as the remove_check function actually requires an available nrpe-external-master for it to reach the Nagios charm. My example charms cleans up the check locally, but the other side never gets the message as the relation is already gone.

Here is snippet of juju debug-log for the order of execution when I run juju remove-relation extra-nrpe-example nrpe. As you can see the relation gets broken first before I get time to call lost_nrpe().

Any tips on reworking my logic or how to call the remove_check function while the relation is departing but not yet departed would be much appreciated. Though I assume the latter option requires a change in the nrpe-external-master interface?

Snippet:

unit-extra-nrpe-example-0: 07:40:39 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Reactive main running for hook nrpe-external-master-relation-departed
...
tracer: ++   queue handler hooks/relations/nrpe-external-master/provides.py:18:broken_nrpe
unit-extra-nrpe-example-0: 07:40:40 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Invoking reactive handler: hooks/relations/nrpe-external-master/provides.py:18:broken_nrpe
unit-extra-nrpe-example-0: 07:40:40 DEBUG unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: tracer: cleared flag nrpe-external-master.available
...
tracer: ++   queue handler reactive/extra_nrpe_example.py:44:missing_nrpe
tracer: ++   queue handler reactive/extra_nrpe_example.py:58:lost_nrpe
unit-extra-nrpe-example-0: 07:40:40 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Invoking reactive handler: reactive/extra_nrpe_example.py:44:missing_nrpe
unit-extra-nrpe-example-0: 07:40:40 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Invoking reactive handler: reactive/extra_nrpe_example.py:58:lost_nrpe

Full log:

unit-extra-nrpe-example-0: 07:39:54 INFO unit.extra-nrpe-example/0.juju-log Reactive main running for hook update-status
unit-extra-nrpe-example-0: 07:39:55 ERROR unit.extra-nrpe-example/0.juju-log Unable to find implementation for relation: requires of juju-info
unit-extra-nrpe-example-0: 07:39:55 DEBUG unit.extra-nrpe-example/0.juju-log tracer>
tracer: starting handler dispatch, 7 flags set
tracer: set flag check.configured
tracer: set flag check.version
tracer: set flag config.default.nagios_context
tracer: set flag config.default.nagios_servicegroups
tracer: set flag config.default.remove-check
tracer: set flag config.set.nagios_context
tracer: set flag nrpe-external-master.available
unit-extra-nrpe-example-0: 07:39:55 DEBUG unit.extra-nrpe-example/0.juju-log tracer: hooks phase, 0 handlers queued
unit-extra-nrpe-example-0: 07:39:55 DEBUG unit.extra-nrpe-example/0.juju-log tracer: main dispatch loop, 0 handlers queued
unit-extra-nrpe-example-0: 07:40:39 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Reactive main running for hook nrpe-external-master-relation-departed
unit-extra-nrpe-example-0: 07:40:39 ERROR unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Unable to find implementation for relation: requires of juju-info
unit-extra-nrpe-example-0: 07:40:39 DEBUG unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: tracer>
tracer: starting handler dispatch, 7 flags set
tracer: set flag check.configured
tracer: set flag check.version
tracer: set flag config.default.nagios_context
tracer: set flag config.default.nagios_servicegroups
tracer: set flag config.default.remove-check
tracer: set flag config.set.nagios_context
tracer: set flag nrpe-external-master.available
unit-extra-nrpe-example-0: 07:40:39 DEBUG unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: tracer>
tracer: hooks phase, 1 handlers queued
tracer: ++   queue handler hooks/relations/nrpe-external-master/provides.py:18:broken_nrpe
unit-extra-nrpe-example-0: 07:40:40 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Invoking reactive handler: hooks/relations/nrpe-external-master/provides.py:18:broken_nrpe
unit-extra-nrpe-example-0: 07:40:40 DEBUG unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: tracer: cleared flag nrpe-external-master.available
unit-extra-nrpe-example-0: 07:40:40 DEBUG unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: tracer>
tracer: main dispatch loop, 2 handlers queued
tracer: ++   queue handler reactive/extra_nrpe_example.py:44:missing_nrpe
tracer: ++   queue handler reactive/extra_nrpe_example.py:58:lost_nrpe
unit-extra-nrpe-example-0: 07:40:40 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Invoking reactive handler: reactive/extra_nrpe_example.py:44:missing_nrpe
unit-extra-nrpe-example-0: 07:40:40 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Invoking reactive handler: reactive/extra_nrpe_example.py:58:lost_nrpe
unit-extra-nrpe-example-0: 07:40:40 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Setting charm primary status False
unit-extra-nrpe-example-0: 07:40:41 DEBUG unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: tracer>
tracer: cleared flag check.configured
tracer: -- dequeue handler reactive/extra_nrpe_example.py:58:lost_nrpe
unit-extra-nrpe-example-0: 07:40:43 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Reactive main running for hook nrpe-external-master-relation-broken
unit-extra-nrpe-example-0: 07:40:43 ERROR unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Unable to find implementation for relation: requires of juju-info
unit-extra-nrpe-example-0: 07:40:43 DEBUG unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: tracer>
tracer: starting handler dispatch, 5 flags set
tracer: set flag check.version
tracer: set flag config.default.nagios_context
tracer: set flag config.default.nagios_servicegroups
tracer: set flag config.default.remove-check
tracer: set flag config.set.nagios_context
unit-extra-nrpe-example-0: 07:40:43 DEBUG unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: tracer>
tracer: hooks phase, 1 handlers queued
tracer: ++   queue handler hooks/relations/nrpe-external-master/provides.py:18:broken_nrpe
unit-extra-nrpe-example-0: 07:40:44 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Invoking reactive handler: hooks/relations/nrpe-external-master/provides.py:18:broken_nrpe
unit-extra-nrpe-example-0: 07:40:44 DEBUG unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: tracer>
tracer: main dispatch loop, 1 handlers queued
tracer: ++   queue handler reactive/extra_nrpe_example.py:44:missing_nrpe
unit-extra-nrpe-example-0: 07:40:44 INFO unit.extra-nrpe-example/0.juju-log nrpe-external-master:3: Invoking reactive handler: reactive/extra_nrpe_example.py:44:missing_nrpe

I think your only opportunity to tear things down in the subordinate is in the -departed hook, and by that time it is too late to inform the other end (the primary). Instead, the other end (primary) will need to remember any details it needs, so that when the -broken hook runs on the primary then the primary can do any necessary cleanup without waiting on further information from the subordinate (which will never arrive). I see no support for this in interface:nrpe-external-master (per https://github.com/cmars/nrpe-external-master-interface/issues/2) , and using raw charm-helpers you need to do that wiring yourself. I don’t think I’ve actually seen a charm using remove_check.

This seems to be another facet of Bug #1417874 “[RFE] Impossible to cleanly remove a unit from a r...” : Bugs : juju-core

Guess it won’t be so straight forward then. I should’ve probably taken the hint when I scoured GitHub for examples of other charms using remove_check and didn’t find much :slight_smile:

Thank you @stub for the background info!