Monitoring Juju Controllers


As we setup our important infrastructure we set up monitoring and alerting so that we can keep a close eye on infrastructure. As the services we provide grow and need to expand, or failures in hardware attempt to wreck havoc, we’re ready because of the due diligence that’s gone into monitoring the infrastructure and applications deployed. One thing I often see is that folks forget to monitor the tools that coordinates all of their deployment and configuration management. It’s a bit of a case of “who watches the watcher?”.

While building and operating JAAS (Juju as a Service) we’ve had to make sure that we follow the same best practices for our Juju hosting infrastructure that you’d use on your production systems you run in JAAS. This means that our Juju Controllers (the state and event tracking back end) need to be watched to make sure that we’re ready to prevent any issues we can see coming and know about issues as soon as possible for the ones we cannot.

Fortunately, there’s some great established tools for doing this. We use Prometheus, Grafana, and Telegraf to help us keep our JAAS infrastructure running smoothly. What’s better, is that there are Charms available for each that enable us to quickly and easily deploy, configure, and operate the monitoring setup. This means you can replicate the operational knowledge we’ve gained by reusing the same exact Charms we use in production.

NOTE: for this setup we’re going to use extra hardware to monitor our production Juju Controllers. This approach is best practice to ensure that we’re not going to be adding any load to what we’re measuring with the measuring itself. It also means that when our Controllers are HA we can wire them each up to a single set of monitoring endpoints. You can also replicate this setup, but done with greater density, by placing these applications on the Controller nodes themselves if you’re hardware constrained or have softer production requirements.

Getting our Controller set up

Let’s walk through an example set up. First, we’ll create a new controller on GCE where we’re going to run our production applications. Note that this will work for any Controller on any cloud you choose to run Juju on.

$ juju bootstrap google production
Bootstrap complete, "production" controller now available.
Controller machines are in the "controller" model.
Initial model "default" added.

One interesting thing in the output at the end is that your “Controller machines” are in the controller model. We want to watch those so let’s switch to that model and work from there.

$ juju switch controller
production:admin/default -> production:admin/controller

The first thing we need is to get our applications for our monitoring stack into the model. Let’s deploy them.

$ juju deploy cs:prometheus2 prometheus
$ juju deploy cs:telegraf

Telegraf needs to be told to work together with Prometheus and send metrics to it. Let’s wire that up.

juju relate telegraf:prometheus-client prometheus:target

Telegraf is meant to watch the system metrics, such as RAM, disk, and CPU. It then allows Prometheus to scrape that data and track it over time. We need to get Telegraf onto each of our controller machine that we want to watch. Telegraf is what we call a subordinate charm. It’s intended to sit and watch existing things that are running. We’ll “fake” a running charm on our controllers by deploying the Ubuntu-lite charm to them using the “–to 0” option. Once Ubuntu is deployed we can relate it to Telegraf which will setup the data output and we’ll sanity check things are wired up properly.

$ juju deploy --to 0 cs:~jameinel/ubuntu-lite ubuntu
$ juju relate telegraf:juju-info ubuntu:juju-info

We can check to see if Telegraf is ready to go and monitoring our Controller machine by visiting the URL it exposes the data on. Note that since we’re checking that from our browser we’ll need to temporarily expose it and allow firewall access to the url.

$ juju expose telegraf
$ chrome

That looks good, now let’s secure that firewall port back up. We won’t need access to the data from the public internet for this to work.

$ juju unexpose telegraf

Setting up Prometheus

With Telegraf reporting to Prometheus now we can check that Prometheus is getting the data. Again, we’ll expose it temporarily so we can view it from our local browser.

$ juju expose prometheus
$ chrome

From here we could start to build graphs of our data, but let’s go to the “Status→Targets” menu and make sure we see our Telegraf watching our Controller.

Like before, let’s close back up the firewall to leave things secure.

$ juju unexpose prometheus

Adding Juju-specific metrics

Telegraf is great for outputting metrics of the system load and the like. However, we have also baked metrics into Juju itself that we can enable. To do so, we’ll need to setup a Juju user account that Prometheus can use to pull metrics from our Juju Controller.

# Note that you don't need to register with this 'bot' user
$ juju add-user prometheus
$ juju change-user-password prometheus
$ juju grant prometheus read controller

For the moment, you need to pass the config to Prometheus that adds a new target for the Juju metrics. I’ve got a sample config file here that you can use to send with the following config call. Note that you need to add the IP address of the Controller to the configuration. I’ve configured this file with my controller IP from this demo. I’ve also set the password to match what I used in the above commands to setup the prometheus bot user.

Sample Config

- job_name: juju
  metrics_path: /introspection/metrics
  scheme: https
    - targets: ['']
    username: user-prometheus
    password: monitor_all_the_things
    insecure_skip_verify: true

$ juju config prometheus scrape-jobs=@controller-prom.yaml

Visualizing that data

With our data flowing in it’s time to watch it. Let’s setup Grafana to talk to our Prometheus where all the data is sitting.

$ juju deploy cs:~prometheus-charmers/grafana
$ juju config grafana admin_password=monitor_all_the_things
$ juju add-relation prometheus:grafana-source grafana:grafana-source
$ juju expose grafana
$ juju status grafana 

We need to configure Grafana through the web UI from here so the last line above prints the IP address and port where it is available. We also make sure to expose Grafana on the public internet so that we can check it out any time we want. We log in with the username admin and the password that we used in the config above.


From here, we’ll need to add a dashboard. I’ve got a demo dashboard you can load in to get started with. You can get it from here and import it using the menu item in Grafana for “Dashboards → Import”. You’ll need to tell it to wire up to our juju-controller data source we setup above and you should get something that looks like this:

From here we can now see how many models are running on our production infrastructure. How many machines those models are taking up. We then start to get the Telegraf data of the controller itself. How’s it doing on memory, cpu, load, and disk space.

Where to go from here

From here we’d obviously look to add more controllers to get to an HA status.

We would setup alerting because what good is measuring if we don’t get a giant hint that things are heading out of whack?

We might break the Prometheus and Telegraf charms into their own model and use cross model relations to wire up the data flow we need. In this way we’d be able to reuse those services for other measuring we want to do within the infrastructure.

What’s good to know is that the teams running JAAS are doing all of this for you. They’ve put together the practices to make sure that you can rely on the services provided and they’re ahead of any potential problems.

What are we missing? Comment below with any issues you find following this or what other things you’d like to see in here!

Managing Juju in Production - ToC

Great post. We should extend it now the cross model relations is a thing; prometheus can be deployed to a separate model/controller and monitoring for a bunch of stuff centralised etc.


Thanks, I’ve just ported what I had. I’m working on next steps to update it and bring it in line with more modern Juju. I can look at CMR for the collection tools as part of it as an optional path.


Ok, post update is complete. Updated the charms to prometheus2 and tested to make sure all the settings still worked out. Please let me know if you find anything and reach out and let me know what you’d like to see from here.



Very nice post. As we have setup (dec-2018) a “Candid” server to connect with a “beta-production” juju controller, we are going to mimic this fully.

We’ll let you in on our progress and findings. @hallback is also present in the work.

Any good advices appreciated.


Awesome, I look forward to seeing how it turns out. My only advice is that definitely pay attention to things like controller load and disk usage as those things are strong indicators of issues that you might need to keep an eye on.


Something that caught me out, is that when setting up the prometheus metrics for the juju controller, the user name has to be “user-prometheus” and NOT the user you just created.


It does have to be user-<username>, but did you create a user other than ‘prometheus’ ?


Why do we get the user to namespace it, considering we could do that in the CLI?


The bytes-on-the-wire when logging into a Juju controller are always prefixed (machine-0, user-simon, unit-ubuntu-0). Juju normally does that for you when you do “juju login” because you can only login from the CLI as a “user”. However, Prometheus needs to send the right username to juju, and so must pass “user-simon” as part of the Basic Auth header in the HTTP request.


I should probably include my snippet that does a decent job with HA controllers:

- job_name: juju
  metrics_path: /introspection/metrics
  scheme: https
      - targets: 
        - '<IP0>:17070'
            host: controller/0
            juju_unit: controller/0
      - targets:
        - '<IP1>:17070'
            host: controller/1
            juju_unit: controller/1
      - targets:
        - '<IP2>:17070'
            host: controller/2
            juju_unit: controller/2
      username: "user-prometheus"
      password: "prometheus"
      insecure_skip_verify: true

Note that it scrapes all the controllers, and then adds an extra label to each controllers metrics (juju_unit and host).

I also have a fairly big prometheus config that uses that. We’ve been iterating on the view we use on our Prodstack controllers, and having that available for my own little testing controller has been really useful.

I should also note that if you are using the default “juju deploy grafana” charms, they use for the grafana packages. And that plays poorly with apt-cacher-ng. You have to allow proxying over HTTPS and they seem to use a lot of different websites. There was some advice that this might work:

PassThroughPattern: (packagecloud\.io|packagecloud-repositories\.s3\.dualstack\.us-west-1\.amazonaws\.com|packagecloud-prod\.global\.ssl\.fastly\.net|deb\.nodesource\.com|packages\.gitlab\.com|packages-gitlab-com\.s3\.amazonaws\.com|packagecloud-repositories\.s3\.amazonaws\.com):443$

However, I tried to set that in /etc/apt-cacher-ng/acng.conf and then sudo systemctl restart apt-cacher-ng and then do “sudo apt update” on the machine running grafana, and it gives me:

Err:6 stretch Release
  Received HTTP code 403 from proxy after CONNECT
Reading package lists... Done
E: The repository ' stretch Release' does not have a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.

However, if I switch to:

PassThroughPattern: .*:443$

It works. However, this allows apt-cacher-ng to proxy for any machine on your network to any https website. So don’t use it anywhere that security matters.