With the landing of PR 9116 - Use StatePool for initial database connections the behaviour of the state transaction watcher worker changed for the controller model.
Back in the dawn of time, every
State object had it’s own transaction watcher that polled the database every five seconds to look for changes. This was fine when there was only every one model, but with Juju 2.0 this changed. Initially developers didn’t notice the load because it wasn’t really a problem in most situations. However the key problem that pushed this change was a ProdStack problem where the API servers would go into a death spiral when someone removed an application in a model. This was due to the 400 or so
State objects waking up and scanning mongo for the changes. There are a lot of document changes when an application is removed, particularly if there are many agents. This lead to
i/o timeout errors from mongo amongst other thigns.
Initial watcher work
Back in early Juju 2.3 a change was made to introduce a different type of transaction watcher. This one was owned by the
StatePool and read the changes from mongo in a much more adaptive manner. It started by looking very regularly (10ms) and backed off if there were no changes to a maximum of five seconds. The worst case poll matched the current behavior. This meant that watchers would be notified much closer to the time of the document changing. This transaction watcher is found in state/watcher/txnwatcher.go. This watcher polls mongo regularly to read the changes to the transaction log collection. It then collates the changes and publishes the event on a
SimpleHub that is owned by the
State objects that are created by the
StatePool have a
HubWatcher as their watcher worker.
Now back to PR 9116.
This change makes the
StatePool the primary object that is opened to connect to the database rather than a single
State object. This means that the
State instance for the controller model is now getting changes more often than before.
The original transaction
watcher and the new
txnwatcher both coalesce multiple changes of a single document into a single notification. However, the original would wait for five seconds and then coalesce changes, whereas the new one coalesces events over a much shorter timeframe.
This means that multiple changes to a document that may have come through as a single change may now come through as multiple changes. Now to be clear, if the code cares, it has always had a bug, we are just surfacing it.
Test Suite changes
If you have a test that is interacting with Mongo, you’ll most likely be using one of two base suites:
JujuConnSuite- this is the one we tell people to stop using
StateSuite- from the
JujuConnSuite does many things. The key as far as watchers are concerned is that it is a wall clock based test suite. That means that the
StatePool is created with
clock.WallClock. This clock is then also passed in to each of the
State objects and the workers for them.
Watchers will be triggered automatically and the
State.StartSync method isn’t needed any more to prod the underlying mongo transaction watcher.
StateSuite uses a
TestClock. This means that time doesn’t advance unless you tell it to. This includes the transaction polling worker. The
StartSync method on the
State object has been updated to advance the clock by a second if it is able to advance (wall clocks aren’t).
This does mean that we can control with a lot of certainty when changes to the documents will be coalesced and when they won’t - by advancing the clock between changes.