As we setup our important infrastructure we set up monitoring and alerting so that we can keep a close eye on infrastructure. As the services we provide grow and need to expand, or failures in hardware attempt to wreck havoc, we’re ready because of the due diligence that’s gone into monitoring the infrastructure and applications deployed. One thing I often see is that folks forget to monitor the tools that coordinates all of their deployment and configuration management. It’s a bit of a case of “who watches the watcher?”.
While building and operating JAAS (Juju as a Service) we’ve had to make sure that we follow the same best practices for our Juju hosting infrastructure that you’d use on your production systems you run in JAAS. This means that our Juju Controllers (the state and event tracking back end) need to be watched to make sure that we’re ready to prevent any issues we can see coming and know about issues as soon as possible for the ones we cannot.
Fortunately, there’s some great established tools for doing this. We use Prometheus, Grafana, and Telegraf to help us keep our JAAS infrastructure running smoothly. What’s better, is that there are Charms available for each that enable us to quickly and easily deploy, configure, and operate the monitoring setup. This means you can replicate the operational knowledge we’ve gained by reusing the same exact Charms we use in production.
NOTE: for this setup we’re going to use extra hardware to monitor our production Juju Controllers. This approach is best practice to ensure that we’re not going to be adding any load to what we’re measuring with the measuring itself. It also means that when our Controllers are HA we can wire them each up to a single set of monitoring endpoints. You can also replicate this setup, but done with greater density, by placing these applications on the Controller nodes themselves if you’re hardware constrained or have softer production requirements.
Getting our Controller set up
Let’s walk through an example set up. First, we’ll create a new controller on GCE where we’re going to run our production applications. Note that this will work for any Controller on any cloud you choose to run Juju on.
$ juju bootstrap google production ... Bootstrap complete, "production" controller now available. Controller machines are in the "controller" model. Initial model "default" added.
One interesting thing in the output at the end is that your “Controller machines” are in the controller model. We want to watch those so let’s switch to that model and work from there.
$ juju switch controller production:admin/default -> production:admin/controller
The first thing we need is to get our applications for our monitoring stack into the model. Let’s deploy them.
$ juju deploy cs:prometheus2 prometheus $ juju deploy cs:telegraf
Telegraf needs to be told to work together with Prometheus and send metrics to it. Let’s wire that up.
juju relate telegraf:prometheus-client prometheus:target
Telegraf is meant to watch the system metrics, such as RAM, disk, and CPU. It then allows Prometheus to scrape that data and track it over time. We need to get Telegraf onto each of our controller machine that we want to watch. Telegraf is what we call a subordinate charm. It’s intended to sit and watch existing things that are running. We’ll “fake” a running charm on our controllers by deploying the Ubuntu-lite charm to them using the “–to 0” option. Once Ubuntu is deployed we can relate it to Telegraf which will setup the data output and we’ll sanity check things are wired up properly.
$ juju deploy --to 0 cs:~jameinel/ubuntu-lite ubuntu $ juju relate telegraf:juju-info ubuntu:juju-info
We can check to see if Telegraf is ready to go and monitoring our Controller machine by visiting the URL it exposes the data on. Note that since we’re checking that from our browser we’ll need to temporarily expose it and allow firewall access to the url.
$ juju expose telegraf $ chrome http://18.104.22.168:9103/metrics
That looks good, now let’s secure that firewall port back up. We won’t need access to the data from the public internet for this to work.
$ juju unexpose telegraf
Setting up Prometheus
With Telegraf reporting to Prometheus now we can check that Prometheus is getting the data. Again, we’ll expose it temporarily so we can view it from our local browser.
$ juju expose prometheus $ chrome http://22.214.171.124:9090/
From here we could start to build graphs of our data, but let’s go to the “Status→Targets” menu and make sure we see our Telegraf watching our Controller.
Like before, let’s close back up the firewall to leave things secure.
$ juju unexpose prometheus
Adding Juju-specific metrics
Telegraf is great for outputting metrics of the system load and the like. However, we have also baked metrics into Juju itself that we can enable. To do so, we’ll need to setup a Juju user account that Prometheus can use to pull metrics from our Juju Controller.
# Note that you don't need to register with this 'bot' user $ juju add-user prometheus $ juju change-user-password prometheus $ juju grant prometheus read controller
For the moment, you need to pass the config to Prometheus that adds a new target for the Juju metrics. I’ve got a sample config file here that you can use to send with the following config call. Note that you need to add the IP address of the Controller to the configuration. I’ve configured this file with my controller IP from this demo. I’ve also set the password to match what I used in the above commands to setup the prometheus bot user.
- job_name: juju metrics_path: /introspection/metrics scheme: https static_configs: - targets: ['126.96.36.199:17070'] basic_auth: username: user-prometheus password: monitor_all_the_things tls_config: insecure_skip_verify: true $ juju config prometheus email@example.com
Visualizing that data
With our data flowing in it’s time to watch it. Let’s setup Grafana to talk to our Prometheus where all the data is sitting.
$ juju deploy cs:~prometheus-charmers/grafana $ juju config grafana admin_password=monitor_all_the_things $ juju add-relation prometheus:grafana-source grafana:grafana-source $ juju expose grafana $ juju status grafana
We need to configure Grafana through the web UI from here so the last line above prints the IP address and port where it is available. We also make sure to expose Grafana on the public internet so that we can check it out any time we want. We log in with the username
admin and the password that we used in the config above.
From here, we’ll need to add a dashboard. I’ve got a demo dashboard you can load in to get started with. You can get it from here and import it using the menu item in Grafana for “Dashboards → Import”. You’ll need to tell it to wire up to our juju-controller data source we setup above and you should get something that looks like this:
From here we can now see how many models are running on our production infrastructure. How many machines those models are taking up. We then start to get the Telegraf data of the controller itself. How’s it doing on memory, cpu, load, and disk space.
Where to go from here
From here we’d obviously look to add more controllers to get to an HA status.
We would setup alerting because what good is measuring if we don’t get a giant hint that things are heading out of whack?
We might break the Prometheus and Telegraf charms into their own model and use cross model relations to wire up the data flow we need. In this way we’d be able to reuse those services for other measuring we want to do within the infrastructure.
What’s good to know is that the teams running JAAS are doing all of this for you. They’ve put together the practices to make sure that you can rely on the services provided and they’re ahead of any potential problems.
What are we missing? Comment below with any issues you find following this or what other things you’d like to see in here!