Upgrade Series Feature Development

The “Upgrade Series” feature development is nearing general availability. If you would like to be able to test this feature while in development the following will get you started:

  • Ensure you are working at the tip of the development branch. The edge Snap can be installed via:

$ snap install --edge juju

  • Deploy a charm which implements the pre-series-upgrade and post-series-upgrade hooks. You can deploy cs:~ecjones/ubuntu-series-upgrade-0 which contains a minimal implementation for these hooks. Be sure to specify a series from which you can upgrade (of course, you won’t be able to upgrade from the latest supported series)

juju deploy cs:~ecjones/ubuntu-series-upgrade-0 --series xenial

  • Run the prepare command

juju upgrade-series prepare <machine> <series>

At this point you can run juju status and watch the pre upgrade charm hook run. You should also see the unit agents shut down (their init system services are stopped). Currently, this is not very pretty from a juju status perspective since juju status reports that it has lost connection to its application units and nothing more.

  • Once the hook completes and the units are shut down (lost connection) try the complete command

juju upgrade-series complete <machine>

juju status should reveal that the units have started back up and have established connectivity.

  • Both hooks should have run to completion without issue

The test charm above implements the pre-series-upgrade and post-series-upgrade hooks. Currently the hooks are implemented to idle (sleep) for 2 minutes. This will give you chance to inspect the status of the running application to ensure that all is functioning as you would expect.

If I understand correctly, any interactions with <machine> between prepare and complete steps should be rejected?

@veebers, your thinking would be correct but the actual prevention of machine operations during the upgrade process has not yet been implemented.

1 Like

Hi,

I have a few doubts about upgrade-series.

  • Will upgrade-series work on a charm that doesn’t implement those hooks?,
  • Will the juju agent take care of running “do-release-upgrade”?

What’s the behavior for a machine that has more than 1 applications running on? are *-series-upgrade hooks executed in a random order?

Thanks,

Yes, the feature will work for a charm that does not implement the hooks. They’ll just not be executed and Juju will update agents and such and hand the user the machine to perform the manual do-release-upgrade steps as usual.

The agent will not take care of do-release-upgrade. The feature is designed for that to be handled by the operator by request of the stakeholders. do-release-upgrade can ask questions about config file changes and more and having that automated seem like it’s not smooth enough to rely on Juju to JFDI. It can be scripted with juju run --machine and so if it’s true that it can be done smoothly the operator can script that part away.

When there’s more than one unit per machine the hooks will be triggered async and waited on. In the final version fo the feature the command will be interactive and you’ll see each step progress including the status of each unit’s hook execution. Once the hooks complete successfully and Juju is ready it hands control to the user to perform any manual/scripted steps with juju run, and then the user hands control back to Juju with the upgrade-series complete command.

Heads up that @externalreality landed a change today that cleans up the lock file after you run complete and you are able to rerun the prepare/complete process additional times. This is only if things go smoothly/correctly. Since you don’t need to currently actually change the series it allows for quick triggering of the pre/post hooks for the charms.

This should be in the devel snap in a couple of hours.

The reactive charms based on layer-basic are currently failing after machine reboot (when the juju agent should still be disabled) due the bug of the config-changed hook firing prior to post-series-upgrade. Is there any update on the juju fix to stop hooks firing on reboot?

Our understanding is that this was corrected from this branch during the sprint.

https://github.com/juju/juju/commit/6aff096f2af23742a6cbf47acbeebe52f85f805f

It’s part of the latest dev snap. Can you verify the snap used and that a fresh bootstrap was used and you’re still hitting the issue?

Thanks

I’m not sure what you mean by “any interactions”.
It is intended that the machine goes into a “manual” mode at that point, and that unit-agents on the machine stop engaging. (they don’t notice config/relation changes, etc).

However, since the intent is that the Operator is then responsible for doing any actual upgrade steps, things like “juju run --machine X” should still work, as should ‘juju ssh X’ etc.

Update:

Liam and I independently are seeing the following using the most up to date edge snap version:

After reboot hooks are held off from firing as desired.

However, after the juju upgrade-series complete command, the first hook to execute is not guaranteed to be post-series-upgrade. We consistently see leader-settings-changed and config-changed execute before post-series-upgrade.

This is most significant for reactive charms which need the opportunity to re-create their virtual python environments. When any hook other than post-series-upgrade executes it will fail due to the venv being out of date.

Thanks, we’re looking into it and will get an update out as soon as we can.

One more issue I see occasionally is: txn-queue for $MACHINE_ID in “machines” has too many transactions (1001)

Example:

machine-10 complete phase started
machine-10 starting all unit agents after series upgrade
ceph-osd/0 post-series-upgrade hook running
ceph-osd/0 post-series-upgrade completed
neutron-openvswitch/2 post-series-upgrade hook running
neutron-openvswitch/2 post-series-upgrade completed
nova-compute/0 post-series-upgrade hook running
ERROR txn-queue for 4fe90b0a-d454-46c1-8579-cfb2d6e11476:10 in “machines” has too many transactions (1001)

Once that occurs no operations against that machine will work including removing it by force:

juju remove-machine 10 --force

removing machine 10 failed: failed to run transaction: []txn.Op{
    {
        C:      "machines",
        Id:     "4fe90b0a-d454-46c1-8579-cfb2d6e11476:10",
        Assert: bson.D{
            {
                Name:  "jobs",
                Value: bson.D{
                    {
                        Name:  "$nin",
                        Value: []state.MachineJob{2},
                    },
                },
            },
        },
        Insert: nil,
        Update: nil,
        Remove: false,
    },
    {
        C:      "cleanups",
        Id:     "5bae4b0d16abcc10eec88b57",
        Assert: nil,
        Insert: &state.cleanupDoc{
            DocID:  "5bae4b0d16abcc10eec88b57",
            Kind:   "machine",
            Prefix: "10",
            Args:   nil,
        },
        Update: nil,
        Remove: false,
    },
}: txn-queue for 4fe90b0a-d454-46c1-8579-cfb2d6e11476:10 in "machines" has too many transactions (1001)

I am trying to be as responsive as possible. Let me know if this is too much noise.

Just tested 2.5-beta1+develop-03d5fc8 and we seem to have regressed. Note post reboot but before the post-series-upgrade hook leader-settings-changed and config-changed executed.

28 Sep 2018 11:03:03-07:00 juju-unit executing running pre-series-upgrade hook
28 Sep 2018 11:13:00-07:00 juju-unit idle <---- *** Reboot happened here ***
28 Sep 2018 11:13:19-07:00 juju-unit executing running leader-settings-changed hook
28 Sep 2018 11:13:29-07:00 juju-unit executing running config-changed hook
28 Sep 2018 11:13:38-07:00 workload blocked Ready for do-release-upgrade and reboot. Set complete when finished.
28 Sep 2018 11:13:38-07:00 juju-unit executing running post-series-upgrade hook
28 Sep 2018 11:13:44-07:00 juju-unit idle

Appreciate it. The change that was trying to land hung up in landing and only now hit trunk. I’m working on getting the snap builds to be manually forced through. I shouldn’t have reached out about the PR with the fix until the snap had been built.

Initial tests with the newest snap look promising. More data to come.

That generally means a txn is broken, preventing other txns from being run. Usually you need to run “mgopurge” with the controllers shut down in order to fix the broken txn before you can do other changes to the document.

It would be good to understand what other txns are listed in the machine record, in case something is wrong with the new code causing the corruption.

Saw the too many transactions problem again today on today’s snap.

juju remove-machine --force 12
removing machine 12 failed: failed to run transaction: []txn.Op{
    {
        C:      "machines",
        Id:     "2b82c210-cb6f-4f9b-8552-b234fc707068:12",
        Assert: bson.D{
            {
                Name:  "jobs",
                Value: bson.D{
                    {
                        Name:  "$nin",
                        Value: []state.MachineJob{2},
                    },
                },
            },
        },
        Insert: nil,
        Update: nil,
        Remove: false,
    },
    {
        C:      "cleanups",
        Id:     "5bb7cb2debf2fa10c9816832",
        Assert: nil,
        Insert: &state.cleanupDoc{
            DocID:  "5bb7cb2debf2fa10c9816832",
            Kind:   "machine",
            Prefix: "12",
            Args:   nil,
        },
        Update: nil,
        Remove: false,
    },
}: txn-queue for 2b82c210-cb6f-4f9b-8552-b234fc707068:12 in "machines" has too many transactions (1001)

Is it possible to get access to the controller? While we know what txn was “one-two-many” it would be good to know what the other 1000 txns on that machine were. Likely something else is broken and this is just the visible fallout.

The following changes are landed in edge and should be available in the latest Snap build shortly:

  • The unit agents are no longer shut down prior to re-writing their service files.
  • Prior to commencing the prepare phase, applications represented by units on the machine have their leadership “frozen”.
  • After the complete phase has run, applications have their leadership “unfrozen”.

Some bug-fixes and enhancements are available in the edge Snap:

  1. The “too many transactions” issue has been addressed by changes both to Mongo transaction handling and to logic around the upgrade-series machine lock.

  2. Previously, if two machines were prepared, and those machines shared units of a common application, completing one machine would unpin leadership, despite still having a prepared machine with units of the same application. This has been rectified - normal leadership expiry for any given application is only restored when the last vested machine asks to unpin leadership.

  3. Pinning and unpinning application leaders no longer happens in the client. The responsibility is now handled by the upgrade-series worker on the machine. This means that CLI feedback for each pin/unpin operation no longer occurs and is in the machine logs instead. A list of applications that will be pinned is reported when running the prepare command.

  4. CLI feedback wording is changed slightly, and there is no message regarding units and pinned applications when the machine has no units running.

  5. Upgrade-series commands are prevented from running on controllers.

  6. It is no longer necessary to set the “upgrade-series” feature flag in order to use this functionality.