We have a cluster with 3 juju controllers on VMs runnning on different hardware node. Each VM has plenty of RAM (24GB) and only runs the juju controller things. The controllers serve 5-10 models with many machines and units (~300 machines, ~2000 units) as well as relations (openstack deployments).
We have noticed great instability over the last 3 months (where our models grew to this extent) which are mainly the following:
Jujud on the controller stops responding, although it seems to be running. A restart of the jujud-machine-X resolves the issue.
(This is most frequent and critical) Juju agents lose connectivity with the controllers. A juju status shows the majority of machines and units as “down” and the agents log several errors like
ERROR juju.worker.dependency engine.go:663 “logging-config-updater” manifold worker returned unexpected error: model “d7854e58-efca-4aba-845a-2dca56980f92” not found (not found)
ERROR juju.worker.logger logger.go:85 model “d7854e58-efca-4aba-845a-2dca56980f92” not found (not found)
All indications we have is that there are no performance issues, but we lack expertise in mongo to be confident about its performance.
Any suggestions are greatly appreciated!