RunUnittest-* general test failures

Comments from CI run morning of September 16

-amd64

github.com/juju/juju/worker/provisioner.TestPackage (from github.com_juju_juju_worker_provisioner)
FAIL: container_initialisation_test.go:182: ContainerSetupSuite.TestContainerProvisionerStarted
  • I believe that @manadart is working on rewriting these tests and should land in a day or two

-amd64-bionic

=== RUN   TestPackage
2018-09-25 14:02:01 WARNING juju.testing mgo.go:212 failed to start mongo: mongod failed to listen on port 45518
--- FAIL: TestPackage (0.36s)
    mgo.go:469: mongod failed to listen on port 45518
FAIL
FAIL	github.com/juju/juju/cmd/juju/romulus/setplan	7.972s
  • This seems to be just that mongo didn’t start. Sometimes it appears not to come up, we don’t really know why.

-centos7

	github.com/juju/juju/service/systemd/cmdline.go:158: error executing "/bin/systemctl": Failed to list unit files: Connection timed out; 
	github.com/juju/juju/service/systemd/cmdline.go:163: List failed (/bin/systemctl list-unit-files --no-legend --no-page -l -t service | grep -o -P '^\w[\S]*(?=\.service)')
	github.com/juju/juju/service/systemd/cmdline.go:112: 
	github.com/juju/juju/service/systemd/service.go:57: 
	github.com/juju/juju/service/service.go:168: failed to list systemd services
	github.com/juju/juju/container/lxd/initialisation_linux.go:456: 
	github.com/juju/juju/container/lxd/initialisation_linux.go:429: 
	github.com/juju/juju/container/lxd/initialisation_linux.go:110: 
	github.com/juju/juju/container/lxd/initialisation_linux.go:65: 
  • all the failures were related to systemd not responding. All tests were attempting to either list the services, or communicate with lxd.

-ppc64le

FAIL: oplog_test.go:103: oplogSuite.TestStops

[LOG] 0:00.000 DEBUG juju.testing starting mongo in /tmp/juju-unittestfs/test-mgo806404599
[LOG] 0:00.000 DEBUG juju.testing using mongod at: "/usr/lib/juju/mongo3.2/bin/mongod" (version=3.2.15)
[LOG] 0:00.081 DEBUG juju.testing started mongod pid 140657 in /tmp/juju-unittestfs/test-mgo806404599 on port 35344
oplog_test.go:111:
    s.getNextOplog(c, tailer)
oplog_test.go:282:
    c.Fatal("timed out waiting for oplog doc")
... Error: timed out waiting for oplog doc

[LOG] 0:10.090 DEBUG juju.testing killing mongod pid 140657 in /tmp/juju-unittestfs/test-mgo806404599 on port 35344 with killed
  • Have seen this intermittently before. The code looks fine and no idea why the document doesn’t appear.

-race-amd64

  • Hit the container initialisation bug above
  • Aslo hit the data race in the resetting of the mgo stats:
WARNING: DATA RACE
Write at 0x0000051c5d88 by goroutine 107:
  github.com/juju/juju/vendor/gopkg.in/mgo%2ev2.SetStats()
      /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/mgo.v2/stats.go:40 +0xdf
  github.com/juju/juju/vendor/github.com/juju/testing.(*MgoSuite).SetUpSuite()
      /workspace/_build/src/github.com/juju/juju/vendor/github.com/juju/testing/mgo.go:493 +0xaf
  github.com/juju/juju/provider/dummy_test.(*liveSuite).SetUpSuite()
      /workspace/_build/src/github.com/juju/juju/provider/dummy/environs_test.go:61 +0x67
  runtime.call32()
      /snap/go/2635/src/runtime/asm_amd64.s:522 +0x3a
  reflect.Value.Call()
      /snap/go/2635/src/reflect/value.go:308 +0xc0
  github.com/juju/juju/vendor/gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
      /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/check.v1/check.go:721 +0x172
  github.com/juju/juju/vendor/gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/check.v1/check.go:666 +0x89

Previous read at 0x0000051c5d88 by goroutine 97:
  github.com/juju/juju/vendor/gopkg.in/mgo%2ev2.(*mongoCluster).Release()
      /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/mgo.v2/cluster.go:105 +0x1fc
  github.com/juju/juju/vendor/gopkg.in/mgo%2ev2.(*mongoCluster).syncServersLoop()
      /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/mgo.v2/cluster.go:363 +0x1bd

-s390x

  • Watcher race in state package: ActionSuite.TestUnitWatchActionNotifications
  • Watcher race in featuretests package: meterStatusIntegrationSuite.TestWatchMeterStatus
  • [thumper 2018-09-27] on further inspection both of these tests have already been “fixed”, so we should add watcher logging to both of them to get more information next time they fail.

Yes. ContainerProvisionerSuite is in flight and looking good.

@jameinel I was wondering if you’d be able to take a look at the oplog_test failure.

 FAIL: oplog_test.go:103: oplogSuite.TestStops

[LOG] 0:00.000 DEBUG juju.testing starting mongo in /tmp/test-mgo845878774
[LOG] 0:00.001 DEBUG juju.testing using mongod at: "/usr/lib/juju/mongo3.2/bin/mongod" (version=3.2.15)
[LOG] 0:00.110 DEBUG juju.testing started mongod pid 30506 in /tmp/test-mgo845878774 on port 40384
oplog_test.go:111:
    s.getNextOplog(c, tailer)
oplog_test.go:282:
    c.Fatal("timed out waiting for oplog doc")
... Error: timed out waiting for oplog doc

[LOG] 0:10.146 DEBUG juju.testing killing mongod pid 30506 in /tmp/test-mgo845878774 on port 40384 with killed

This happened again today in two different runs, once on arm64 and once on ppc64le I think. Both of these are still set to use xenial and are using mongo 3.2 (as you can see from the output above).

I have looked at this one in the past and not figured it out, and today I had another go with @wallyworld, but neither of us could see where this failure could be coming from.

I got briefly excited when I thought I worked it out, but on further reading I realised that wasn’t it. I had thought that the constructor function should wait for the loop to fully initialize, but when I read the loop function I realised that there is no danger in it starting later.

I also got briefly excited thinking I had found a handle leak only to find that it isn’t the case. I had thought that the defer for the iter.Close() was capturing the value of iter in the closure, and so the close and recreate further down the main loop would lead to a leak, but on testing found that this isn’t the case as the iter in the closure is captured by reference, so changing iter later in the function is fine.

So, long story short, we couldn’t figure it out, and perhaps fresh eyes would help.

I couldn’t get it to reproduce locally.

So I verified that we do apply the patches to 2.4 and earlier, to ensure that we’re not blocking any current releases. The issue we do have is that we no longer apply the patches in develop (2.5) and we should fix this. I’ve made a bug (LP #1795007) to track this.