Stable controller startup under heavy agent load

When there are many agents managed by the controller, the controller can sometimes have trouble starting up as it needs to connect to itself over the API for many of the workers, and additionally every other agent is also trying to connect to the controllers at the same time.

The problem here is how to enable the controller to start and get all the workers stable before opening the flood-gates to the external agents.

A key problem here is that the controller should really not even accept any connections from the external agents before the controller has established that it is stable.

A side consideration that we should make as part of any change here is that we need to add some callback from the http server worker so it can notify the port number (or numbers) for the API connections. This is to deal with the race condition we currently get in some of the full stack agent tests where it opens a port just to get a number and then closes it and passes that number in as config so the machine agent can use it only to find that it has been used by another test.

Proposal

Based on convesations with @jameinel I think that the best approach is to have a second API server port that is used just for the controller to communicate with itself, and controller to controller.

I had been trying to work out how to shoe-horn this into the current system with minimal impact and I think I have worked out the best place.

In the main loop function of worker/httpserver a create the main listener with this:

listener, err := net.Listen("tcp", listenAddr)

What I think we need here is a custom type.

This type needs to implement net.Listener, and should initially wait on the controller_api port. It should also have another method to enable waiting on the normal api port. The Accept method needs to coallesce the Accept calls from both the ports.

The httpserver manifold should also provide an interface over the output method for the interface to open the api port.

The code in agent/agent.go already knows about whether or not to use localhost as the address, and this code would need to be updated to use the controller port when connecting to localhost.

The primary hard piece then is when does the apiserver tell the new listener type to open the other port. I think it should be once the peer grouper has sent its initial message, as this indicates that the api connection is running, and the peer grouper has determined who should be up and running.

This new listener type is also a key place where we should be ratelimiting the Accept calls. We should rate limit only the api port and not the controller_api port.

1 Like

I like the idea of listening on a different port for connecting to itself. The one major caveat for why we might not want to use it for Controllers to connect to each other is that it requires opening up a new port in firewalls. I would support allowing it, and even recommending it, but Iā€™d definitely want a way to disable it so that post-upgrade we donā€™t break inter-controller communication especially in sites where we donā€™t control the firewall.

It isnā€™t uncommon for web libraries to make it easy to poll multiple sockets, so we could provide our own layer, Iā€™d be curious if weā€™d prefer to use something that already exists.

I definitely like this for ā€œconnecting to selfā€ and not allowing third parties to connect until we see the self connection established.

If we were concerned about the extra port, I would even consider an Accept that rejects everything that isnā€™t local until we get a local connection, but I think this is ultimately better.

Iā€™m trying a quick attempt at an Accept that rejects non-controller connections as a first pass at this.

What is the key for when we allow other connections? Once we reach a quorum of controllers?

Iā€™m initially going to try the peer grouper published event plus a small delay :slight_smile:

OK, results are in for the simple test. The first attempt was a wrapper around the Accept call for the standard API port that rejected connections from non-apiserver machines.

First attempt with that was to close the connection and return an error instead. However that kills the http.Serve command. Looking into the http module it handles temporary net errors with an exponential backoff. However we donā€™t want to trigger that as any rejected connection increases the delay before we could get a correct connection.

Second attempt was to return the connection but close it first. This caused many tls errors as it was trying to write to a closed connection, and there were rapid reconnections. It appears that the remote API connection has no backoff on accept failures.

Based on these results, I think we need to look at a different solution.

One recommendation was to use an abstract domain socket for the juju controller as this avoids extra TCP ports used. Iā€™ll try this next, this will also need a change to the pubsub worker as it gets its connection details from the AgentConfig, but this is also where the connection details are overridden to be localhost, and will be the domain socket. So weā€™ll need to get the API port number a different way.

After more investigation an abstract domain socket isnā€™t going to work as that requires a net.Dial(ā€œunixā€, ā€œā€¦ā€), and the gorilla websocket dial only does ā€œtcpā€.

I think the only sane approach now is to have an additional optional controller config value that if set, means the controller will do the two phase startup. If not set, behaviour will continue as today. This means we arenā€™t requiring any additional ports on upgrade, but still allow the controllers to be configured with an additional port.

Sounds like a good plan. Iā€™m a bit surprised we couldnā€™t reject incoming connections in the Accept call, but I guess breaking http.Serve is a problem.

That said, I do think we should just add connection backoff in the client side. We didnā€™t do it originally, because we felt you canā€™t trust clients to do the right thing, and by doing ā€˜we wonā€™t reject you until 5s laterā€™ we were implementing backoff server side.

That said, we can have all of our clients be gentle and still have the world be a better place.