This has been an ongoing problem within the acceptance tests where we have a tendency to timeout when tearing down the test. It can happen for different providers, so it’s not like it’s one problem child and when looking into the reason for the tear down slowness it’s not always for the same reason. Recently we’ve also had issues tearing down a controller/machine that hasn’t had anything to do with the acceptance tests, but a real bug in Juju. It was caught by CI tests early on, but because of the cry wolf effect of our current setup it was ignored.
I have been thinking about what to do about this issue for sometime and I was originally a very big fan of just ensuring the success of an acceptance test was measured on the tests ran in the acceptance test and not on the test as a whole. My original theory was that, we as developers whom write the tests only ever care about what the tests we wrote due. We shouldn’t care about the test clean up and the log parsing as that is superfluous to the actual test. In my mind did my tests pass or not.
I brought this up in the last sprint speaking to @veebers about this whole fact, but we have a bit of a gap in the way the tests have been designed, in the fact that there isn’t that many teardown tests (1 is being written and added currently). With Juju deploying so many wide variety of scenarios to various providers it would be difficult to write so many tests to test all of these. The teardown tests provide that security blanket for us. Unfortunately I’m unsure if that’s right or wrong! Initially for me it felt wrong, but I’ve changed my mind recently and understood the design of the tests also ensure that we can bootstrap and teardown anything that we throw it at and that has to be a good thing? Also I’m not a fan of blowing away the tests in the containers once they’ve run, skipping the teardown altogether, for the very reason that we then have to garbage collect all the instances, which I know we do for problem/stuck instances at the moment, but I personally don’t think that’s resilient to all the tests we run. That could end up costing a small fortune as we might have stragglers running for sometime and those instances might not be cheap.
So what do we do about the issue of teardown failing our tests? That I don’t know, but what I do know is that we shouldn’t be increasing the default timeout constantly as this just highlights there needs to be some thorough investigation about why the providers are slow at tearing down!
If anyone has any other ideas, I’m all ears, because for me the only way forward is speed up Juju…?