What are your tips for running "Juju in production"?

timClicks · 2 February 2020 23:45

Some leading questions perhaps, hopefully this will be the start of a useful discussion though?

How granular are your models? Do you restrict access via
When should you enable HA controllers?
How frequently do you make backups? What do you backup? Have you tested a restore?
Have you enabled any specific firewall rules?
…

thumper · 3 February 2020 00:59

You should enable HA controllers in any production environment. Single controllers are fine for testing and playing, but I’d say if you are in a production environment, then HA controllers is essential.

erik-lonroth · 3 February 2020 19:07

We are just discovering this at Scania atm. We have been up and running for some 3 months, but still need more experience.

We have 3 units of controllers in ha behind 2xhaproxy.

A haproxy charm with ‘let’s encrypt’ is well needed since we need to set that up manually at the moment.

Really, it’s just weird that it doesn’t exist bundles for this purpose already in the charm store.

zicklag · 3 February 2020 20:05

Hey, you might be interested in the Let’s Encrypt proxy charm that I’m designing right now, then. It is something that I have a need for so I started work on it 4 days ago. I’m thinking I should be finished fairly soon. Maybe within a week, but no promises.

There is an existing SSL proxy charm, but I don’t know whether or not it is scalable. I thought I had read something that said it wasn’t scalable, but now I can’t find it.

The charm I’m working on will allow you to scale the proxy to any number of units while making sure that the certs are replicated to each unit and that only the leader is actually generating the certificates, so that you don’t exceed the Let’s Encrypt rate limit.

Then you just point your DNS at the proxy servers and it will generate required certs while routing the traffic to the Juju applications that are connected based on the incoming domain.

erik-lonroth · 3 February 2020 20:21

You should connect with @lasse and @martin-hilton aswell.

I’ve used ssl-termination-proxy for nextcloud previously.

jameinel · 4 February 2020 09:17

I’m curious why you run Juju controllers behind HA Proxy. All of the juju clients already know about the multi controller nature, and will automatically fail over to another controller if the one they are currently connected to dies, and while connected they maintain the list of active controllers, so a new one coming up should get added to the list of possible contacts.
The only failure mode that I’m aware of is if you are killing controllers fast enough that you’ve rotated all controllers away in the time that a client has been disconnected. (eg, you have 3 running today, client disconnects, then by next week you have killed and started a 3 new controllers, a client connecting won’t actually know any of the new controllers to discover the rest.)

rick_h · 4 February 2020 15:27

@jameinel did some great work on creating a bundle setup to help with getting monitoring setup on your controllers as well as a pre-made dashboard ready to go. I’ve updated the post around monitoring controllers with that setup and have a bundle in the store that is a bit of plug-n-play (well almost) that I encourage folks to check out.

timClicks · 4 February 2020 22:43

@rick_h I had actually forgotten that you’ve already compiled a great list of resources!

soumplis · 8 February 2020 10:01

In order for Juju to be applicable to large production environments a lot of work on RBAC and authentication is needed. There is a huge lack of commonly used auth backends (ex. ldap, oauth, saml etc) or integration with cloud providers auth (ex. MaaS, OpenStack, etc). On top, we have very limited RBAC and it is impossible to have user groups, per application access, per command access (ex. allow a user to change charm config but not run commands on units) etc.

Another required feature would be to allow Juju controller services decomposition, or containerization. Scalability is an issue and being able to use an external, maybe optimized, MongoDB would help or being able to spawn a Juju controller on k8s.

timClicks · 12 February 2020 22:02

Here are some useful threads:

How to put controllers behind HTTPS

Controllers exposed to the Internet should (at a minimum) be backed by TLS.

How to use an external identity provider

Look into Juju’s internals

Production users benefit from an understanding of how Juju gets its work done. Internally, Juju is a network of a software agents (jujud processes) in a star typology. The central node is the controller.

To create a report of any given agent, juju ssh into the machine, then run juju_engine_report:

$ juju ssh <machine-id>
$ juju_engine_report

Under Kubernetes, juju ssh is unavailable. Use kubectl exec to access the operator pod (which is where the relevant agent is executing). You will also need to include the scripts to your session with source.

$ kubectl -n <model> exec -ti <application>-operator-<unit-number> bash
$ source /etc/profile.d/juju-introspection.sh
$ juju_engine_report

The juju_engine_report provides valuable diagnostics. A useful periodic task is to run an engine report for each jujud process on each machine. Some tooling has been developed to help isolate problems and aid debugging:

szeestraten · 17 February 2020 16:01

I’d like to second @soumplis comment here and from our view there the RBAC and auth options are lacking. Regarding the MAAS side of things, it advertises the following features:

Authentication and Identity
Integrate with LDAP, Active Directory or SAML for central identity management and single-sign-on across multiple MAAS regions.

in addition to RBAC if you pay for support. However there is no docs on what those features are or how they actually work. See some old posts in their discourse (1, 2) asking for details, but without much response. There is a blog post about multi tenancy in maas which talks a little about the RBAC features, but that is really not enough to go on. It also talks about using Candid for LDAP.
Why no documentation? Put simply, I won’t buy unless I know what I am buying.

Speaking of Candid, what is the state of that project and what is its scope both for MAAS and Juju? I understand that it can be an external identity provider for Juju from some of the posts above, but where is the documentation?

sssler-scania · 1 April 2020 09:17

These are my experiences from now running a production grade installation for about a year 2019 - 2020. I expect alot of good things for this year.

== Some general experiences from 2019-2020 ==

PROS (our experiences)

Juju has great stability.
Upgrades of controller infrastructure has been super.
Good support from Canonical.
Fantastic community support.
Fast moving development.
Brings in “Infrastructure as Code” in a perfect way.
Adds tremendous efficiency in deploying complex stacks. For example SLURM that is our largest use-case at the moment.

CONS (our experiences)

Error messages that comes from juju are confusing for end-users and provides seldom any assistance to how to deal with them. Its a problem, since it affects the impression of juju in general. Presently, only introduce juju to hard-core devops that can manage this situation.
Bad support for centos (or any other distro than ubuntu), which we need.
Problematic to use vsphere in a multi-user setup, since tenant isolation is not fully understood by us. Perhaps its possible with later versions…
Very fragmented documentation.
The charm store and GUI is not working well yet - I know there are large changes coming in.
Writing charms is difficult, very little “best practices” exists e.g. you need experienced developers to develop complex charms.
There is no quality assurance process for charms which makes it difficult to know if its you, or the charm, that is the problem when your “juju deploy foobar” fails.

== Our core setup of Juju infrastructure ==

Model: controller
This is the “core” infra controller service, with no integrations. We keep this controller sacred, backed up, secured and with very restricted access.

3x HA juju controller.
1x prometheus2 for monitoring.
1x ntp subordinate for all units.

This core controller provides the foundation to the rest of the juju infrastructure which lives in this controller as models:

Model: candid
Provides authentication for the other controllers.

3x HA with ActiveDirectory backend
2x haproxy letsencrypt
1x prometheus2.
1x ntp subordinate for all units.

Model: jimm
Provides the juju client endpoint to jimm.foobar.com. Very much like jaas, but private to us.
Clients do “juju login jimm.foobar.com”, use their ActiveDirectory password and be off to the races.

3x jimm units
2x haproxy letsencrypt
1x prometheus2 for monitoring.
1x ntp subordinate for all units.

Cloud controllers
We run three models with separate juju controllers for cloud substrates: MAAS, vsphere and AWS.

They run all as
3x HA juju controllers
2x ha-proxy with letsencrypt.
1x prometheus2 for monitoring.
1x ntp subordinate for all units.

=== Availability ===
So far, we have a 100% uptime on the service at large.

=== Performance ===
So far, we have not experienced any performance problems. We scaled our controllers fairly high, which is probably a good idea. The mongodb consumes alot of RAM (16?) so I expect performance to be something we need to work on. Lately, juju add-model has started to take some time (20sec +) on the maas controller. But I think this is just normal every day work that has to be done.

=== Complexity ===
Running juju is non trivial. A “operator handbook” would be something well needed since working up experience in all these new technologies is difficult. If you intend to bring up a enterprise grade installation of juju, make sure to bring in some help.

=== Value ===
We are starting to derive value of juju from a few very hard to get properties in our infrastructure:

Increased efficiency for each juju-capable engineer (devops) by magnitudes. Yes, this can not be underestimated and makes the effort worth while.
We are able to equip our “operational staff” with “actions” which produce deterministic outcomes from the execution of operational tasks. That lowers the error rate and increases the precision and can also be added to automation workflows later on.
We have leveraged the improved collaboration between teams with related, but separate challenges. Juju gives us a way to speak the same language around different topics, without having to change much of the already invested time in existing tools. Its not destructive to already created values so to say.

If you need any advice, just feel free to reach out.

zicklag · 1 April 2020 14:58

That’s a great breakdown. Thanks for that @sssler-scania!

timClicks · 10 May 2020 21:16

Related thread

erik-lonroth · 2 February 2022 12:35

Hows it going with this charm?

zicklag · 2 February 2022 19:59

Hey @erik-lonroth, I haven’t used it in a while as the project it was used for no longer uses Juju, but it was deployed and working well for for a time and can be found here:

There just won’t be much in the way of support for it because we are no longer using it.

erik-lonroth · 2 February 2022 20:01

Too bad. I’m looing for something that would be possible to use at least for some time.

May I ask what made you abandon the juju path?

zicklag · 2 February 2022 20:13

It was my colleague who was actually using Juju, but then in the scenario the project he was working was in, it made more sense to use Kubernetes just to make the application deployment match the rest of the similar apps that were already being deployed in Kubernetes.

Not that we have any love for Kubernetes, but if he was going to have to use Kubernetes for everything else, it didn’t make sense to introduce another tool as well.