Juju's hierarchy of needs

thumper · 2 December 2019 20:52

Similar to those covered by Maslow, Juju has a hierarchy of needs. If the bottom layers aren’t good, the layers above are going to have issues.

Here we are assuming HA controllers as that is the recommended production deployment structure. The base level is the controllers themselves. And within the controllers, there are needs:

Do the physical machines have network routability? If the machines can’t physically talk to each other there are problems.
Are the physical disks operating properly? You might think this is obvious, but there was a past issue we had where one of the VMs had i/o load where it was in effective timeout for 1.5 hours every 2 hours.
Is mongo running and healthy? Is the replicaset operational and non-laggy? Check /var/log/syslog for mongo output. Are there massive timeouts for queries? What does mongotop show for mongo load?
Are the controllers operational? Are the workers running? juju_engine_report on the controller machines will show this. juju_presence_report will show the machine’s view of connectivity, and juju_pubsub_report will show if all the machines are connected to each other properly for forwarding of messages - which is now essential for leadership.

If the controllers are good and mongo is healthy, then you can move onto the deployed machines.

The deployed machines need to be able to route to the controllers. If they can’t, then they will not be able to deploy - the GET for the agents will fail during cloud-init
Have the agents been set up properly? Assuming the machine agent has started, it deploys the units. Are the services running? Are the symlinks set correctly? Every agent is able to give an engine report using “juju_engine_report [unit-name]” where the unit-name is the stringified unit tag (i.e. mysql/1 is unit-mysql-1). If you call the engine report without a unit name, it gets the engine report from the machine agent.
If the agents are all operational, and the controller is happy, the last place we generally look for failures is the charm code itself. This is normally more obvious as there are hook execution errors.

Now, often when doing debugging you start somewhere in the middle of this list. Before you end up getting too stuck trying to work out why one thing isn’t working, first check the assumptions that the layer you are checking are valid.

Hopefully this will help you in your diagnosing of Juju issues.