Juju loses communication with agents

We have a cluster with 3 juju controllers on VMs runnning on different hardware node. Each VM has plenty of RAM (24GB) and only runs the juju controller things. The controllers serve 5-10 models with many machines and units (~300 machines, ~2000 units) as well as relations (openstack deployments).

We have noticed great instability over the last 3 months (where our models grew to this extent) which are mainly the following:

  1. Jujud on the controller stops responding, although it seems to be running. A restart of the jujud-machine-X resolves the issue.

  2. (This is most frequent and critical) Juju agents lose connectivity with the controllers. A juju status shows the majority of machines and units as “down” and the agents log several errors like

    ERROR juju.worker.dependency engine.go:663 “logging-config-updater” manifold worker returned unexpected error: model “d7854e58-efca-4aba-845a-2dca56980f92” not found (not found)
    ERROR juju.worker.logger logger.go:85 model “d7854e58-efca-4aba-845a-2dca56980f92” not found (not found)

All indications we have is that there are no performance issues, but we lack expertise in mongo to be confident about its performance.

Any suggestions are greatly appreciated!

2 Likes

@soumplis thanks for the post here. It seems like you’ve got a really good sized set of infrastructure there.

We’ve found issues that seem a lot like this and addressed a number of bugs in Juju controllers that helped our experience. It’d be interesting to know what version of Juju your controllers are on.

In going back and checking the release notes I think a big release target would be

The version is the latest stable (2.6.8) and the underlying OS (Ubuntu Bionic) is also up to date with the latest packages upstream.

ubuntu@cfp001:~$ juju show-controller ctl-maas
ctl-maas:
  details:
    uuid: 352cbcb5-b601-4768-88f6-59f44546ca88
    controller-uuid: 352cbcb5-b601-4768-88f6-59f44546ca88
    api-endpoints: ['10.0.255.2:17070', '10.0.255.3:17070', '10.0.255.4:17070']
    cloud: maas_cloud
    agent-version: 2.6.8
    mongo-version: 3.6.3
    ca-fingerprint: (...)
(...)

It would be good to know what version of Juju you are running. We do have our production controller that runs stably at approximately the same size (1100 total machines, 260 models, 6000 total units). When you say 5-10 models, 300 machines, is that 10 * 300 machines = 3000 machines, or is it 300 machines distributed across 10 models?

We did do several fixes in the 2.5 series that helped with stability for our production controller. (It has been running 2.5.1 stably since it came out, though we are looking to upgrade to 2.6.8 in the next week.)

Random disconnects sounds like something like an MTU mismatch.

Model not found seems like a much more significant issue, so that seems a bit version dependent (is there a known bug in that version of juju that might see that error).

Another recommended practice if you are going to run juju controllers at scale is to set up Prometheus and Grafana:

That can often give very interesting insights into running operations, and what is going on. (What API calls are taking a long time, what database operations are being a problem, etc.)
I know we’re a bit busy at a conference this week, but likely we’d be interested to see the production system and see if there is anything we can help with.

Doing the maths, in total we have 300 machines, 7020 cores, 2076 units in 6 models (actually there is one model with our production openstack setup which consumes almost 90% of the infrastructure). I will double check the MTUs.

Our version is the latest, 2.6.8 and yesterday we have set up the controller monitoring as per the instructions. Is there any possibility that you can provide a more detailed grafana dashboard, it would save us some time.

Thank you in advance !


Is the Grafana Dashboard that we use in production. However, it will need tweaking to match the installation instructions that are mentioned in the other discourse post (the names of data feeds, and controllers, etc).

It also includes some other data feeds (I believe we have a Mongo source and some other bits).

If you do work through that as a reference and update it for the “Monitoring Juju Controllers”, we would be very happy to have a version included in that thread.

I think I know the root cause here. The internal model cache in the controller seems to be not up to date. The logging-config-updater gets its information from that cache.

First thing to do is to determine which workers are having issues in the controller. We have some tools on the juju created machines to help look inside running systems:

The command that I’d like you to run is:

juju run -m controller --machine 0 juju_engine_report

This would get the information about the internal workers running in that controller. Run this command on each of the machines in the controller model.

This should give us enough information initially to determine where things are.

There is a known bug in 2.6.8 where unit agents may get stuck restarting, if there were issues in particular workers. We have this fixed in the upcoming 2.6.9 release.

I have adjusted the dashboard to fit our needs. The updated dashboard is here .

For reference, some things to configure in order for this to work:

  • Add custom labels to prometheus for each controller our config for each controller is similar to:

    - targets: 
      - '10.0.255.2:17070'
      labels:
          host: juju-1
          juju_controller: controller/0
          juju_unit: ubuntu/0
          juju_model: controller
          group: 'juju-introspect'
    
  • In order for the prometheus-mongodg-exporter to work on the juju mongos you have to “retreive” the correct SSL CA cert and SSL cert to connect to the mongo instance.

sed -n ‘/BEGIN/,/END/p; /END/q’ /var/lib/juju/agents/machine-?/agent.conf | sed -e ‘s/^[[:space:]]*//’ > /etc/default/prometheus-mongodb-exporter.juju-ca

cat /var/lib/juju/server.pem > /etc/default/prometheus-mongodb-exporter.juju-crt

1 Like

Here are the results for machine-0 , machine-1 , machine-2 .

1 Like

I took a look at the engine reports. Nothing really leaps out.

I’m very curious about what the controller logs say around the times that the unit agents get disconnected. Clearly there is something weird going on there.

Just to say we have other controllers that are running hundreds of models with thousands of units and we don’t see this type of instability. I’d really like to get to the bottom of it.

What is the spec of the machines you are using as controllers? Are you open to having a more direct email exchange or perhaps a hangout?

The machines running the controllers are VMs with 8 vcpus and 24GB ram. All with access to typical SATA local disks (on the physical host). Following the recommendation of @jameinel I have deployed the Grafana dashboards to get insight of the controllers and some of the first things to notice are the following

  • CPU usage is constantly between 30-40% with small spikes up to 80-100%
  • Disk I/O wait is below 100ms but there is constant I/O around 3MBs
  • API calls are around 400 with spikes to 600 and the most frequent api call is ProxyUpdater.ProxyConfig
  • There is a constant flow of 1500-2000 logs written

What it is important to note is that since this post there have been no similar errors and I suspect that the problem was due to excessive I/O requests due to juju logging. When I changed the log level from unit DEBUG to WARNING

juju model-config -m openstack logging-config="<root>=WARNING;unit=INFO"

This seems to have greatly reduced the amount of logs produced, however I did not had in place the appropriate metrics to provide actual numbers.

For sure I can provide any further details or debugging staff, I will send you in private contact details.

1 Like

That is not expected behavior. It sounds like something is wrong (units getting restarted, some sort of errors/warnings should be in the logs.)

Leadership.ClaimLeadership should be the #1 call if things are in a stable state (it is called once every 30s for each application in your system to maintain leadership).

The proxy worker is woken up in two situations:

  • the model config is updated with new proxy values
  • the API host ports change

Either that, or the worker is failing for some reason and being restarted.

With 2000 units, I’d expect about a third of the load that we see on our controller.

We have a fairly constant 350 API calls per second across our three controllers.

In an idle controller, most of the calls end up coming from the unit update-status hook calls, which happen every 5 minutes on average. Our controllers have the top call being ClaimLeadership at around 63 calls per second. That is twice the frequency of the next most frequent callṡ which are on the Uniter facade.

There is one ClaimLeadership call per application per 30 seconds.
The Uniter frequency relates to every unit, with a calls every 300 seconds.

If you have 2000 units, there should be approximately 10 calls per second on the Uniter methods, and depending on your ratio of units to applications, slightly more ClaimLeadership calls.

The fact that you have ProxyUpdater.ProxyConfig as the most called is signal of an issue. Now to work out what it is.

Another interesting finding is that today we noticed that agents did not communicate(?) with the controllers, without any indication (error log, process fails, juju status etc). We restarted all jujud-agent for all units and since that time the Raft lease operations from below 1 sec, is constantly between 3-5 seconds and also the network traffic on one of the controllers is constantly over 10Mbps (investigating that at the moment).

You have a number of weird things in the engine report and grafana that make me go :thinking:

  • The state pool report on just one of the controllers lists an empty model UUID
  • The high rate of ProxyUpdater.ProxyConfig
  • The high time for the raft updates
  • The network load on one of the controllers

I’m curious as to your findings around the controller with the 10Mbps traffic. What sort of network cards are in those machines? Has it maxed out the throughput?

Thinking back to the initial problem, some of those lost and disconnected agents could well be due to the mongo reelecting a leader due to heartbeat timeouts. There are a number of bugs that we have been squashing that stop agents from terminating properly, one of them in the uniter code, which causes a unit agent to appear lost because it never restarts properly.

Let’s try and grab some more diagnostic information:

  • Are all the models also running 2.6.8? What versions of Juju do we have in play?

    • what is the controller running?
    • what are the models running?
    • what is the CLI client running?
  • Are you using and other client code to talk to the controller? Old python scripts? I’m trying to work out where the empty model UUID came from.

  • The proxy updater is driven by some low level watchers. The only true way to see what is going on there is to bump the logging and capture some.

    • Run the following juju model-config -m controller logging-config to get a record of the current logging config.
    • Run the following juju model-config -m controller logging-config="juju=info;juju.state.txn=trace" to get Juju to write out every write transaction to the log entry. Capture about five minutes of logs, then put it back to what it was before.
    • Looking at those logs should tell us why the ProxyUpdater is getting triggered so much.

Just to let you know, I have worked out what is going on here, and will have a fix up shortly.

Sorry for the late reply, been disrupted by other emergencies. The excessive network traffic was from the prometheus container running on one of the controllers which for some reason generates way too much broadcasts. However it never maxed out the bandwidth, at top 15Mbps on a 1G link to the physical host.

So, regarding your questions:

  • Controller is at 2.6.8
  • All models with at least one unit are running 2.6.8. Only two (of six) models on 2.6.4 but are empty models.
  • The cli client is a 2.6.9-bionic-amd64
  • There is not interaction with juju apart from the juju cli and juju gui
  • As for the logs, it seems not to have any string related with ProxyUpdater. Instead when I set the unit=DEBUG I have some logs, should I upload those somewhere ? Sorry If I miss something here!
1 Like

As a heads up @thumper did some awesome chasing down of the problem and has a fix here:

https://github.com/juju/juju/pull/10677

This is going into the 2.6.10 release which is in QA testing atm and we’ll work to get out next week. It appears that due to a sorting (or lack thereof) Juju would think it needed to update the proxy data too often. This should correct that and bring some better stability to things.

2 Likes