Can't remove model, last application "waiting for machine"

Hi all

I have a model that’s been trying to destroy itself for a few days now. It hangs on “model not empty” because one application is in the state “waiting for machine”.

Is there a way to force-destroy this model?

model:
  name: cot-dev2
  type: iaas
  controller: vmware-main
  cloud: vmware1
  region: ILABT
  version: 2.3.3
  model-status:
    current: destroying
    message: 'attempt 1 to destroy model failed (will retry):  model not empty, found
      1 application (model not empty)'
    since: 19 Sep 2018 14:18:20+02:00
  meter-status:
    color: amber
    message: user verification pending
  sla: unsupported
machines: {}
applications:
  nginx-api-gateway:
    charm: local:xenial/nginx-api-gateway-0
    series: xenial
    os: ubuntu
    charm-origin: local
    charm-name: nginx-api-gateway
    charm-rev: 0
    exposed: false
    life: dying
    application-status:
      current: waiting
      message: waiting for machine
      since: 18 Feb 2018 21:56:24+01:00
    endpoint-bindings:
      upstream: ""
      website: ""
controller:
  timestamp: 10:50:08+02:00

We have a bunch of models that are in this state. This might actually be the reason for the 40Mb/s traffic

It could well be.

For quite some time we have thought about having a destroy-model --force for those situations where things get stuck when they shouldn’t.

One of the underlying bugs here I think is that the application destruction shouldn’t be blocked by the waiting for machine in this situation (and probably a few others).

This seems a lot like lp:1654928, which claims to be fixed. So maybe this is a different issue.

I believe we’ve talked about having a “remove-application --force” of some sort. That said, I think we can just fix this one. If an application has a unit that has no machine available for it, we should just allow that unit to be removed and the application killed.

The reason we want to remove some of these models is exactly because of this “stuck” application. We hoped removing the model would fix it. So remove-application --force would actually help us even more, since it means we don’t have to remove the model anymore.

Is there a workaround I can use to remove these applications and models currently, while waiting for the fix? I’d like to see if that reduces the network traffic.

I think it resembles this bug report more: Bug #1724673 “unable to destroy-model with offer in it” : Bugs : juju

All the “stuck” applications have had or provided cross-model relations. Below is a yaml status of another model that fails to destroy. Here you can still see some of the remnants of cross-model relations.

juju status --format yaml
model:
  name: providenceplus
  type: iaas
  controller: vmware-main
  cloud: vmware1
  region: ILABT
  version: 2.3.8
  model-status:
    current: destroying
    message: 'attempt 39 to destroy model failed (will retry):  model not empty, found
      2 applications (model not empty)'
    since: 20 Sep 2018 10:31:00+02:00
  meter-status:
    color: amber
    message: user verification pending
  sla: unsupported
machines: {}
applications:
  kafka-rest:
    charm: local:xenial/kafka-rest-confluent-k8s-3
    series: xenial
    os: ubuntu
    charm-origin: local
    charm-name: kafka-rest-confluent-k8s
    charm-rev: 3
    exposed: false
    life: dying
    application-status:
      current: waiting
      message: waiting for machine
      since: 20 Jun 2018 09:34:30+02:00
    relations:
      kubernetes:
      - leggo
    endpoint-bindings:
      kafka: ""
      kubernetes: ""
      upstream: ""
  kafka-rest-k8s:
    charm: local:xenial/kafka-rest-confluent-k8s-0
    series: xenial
    os: ubuntu
    charm-origin: local
    charm-name: kafka-rest-confluent-k8s
    charm-rev: 0
    exposed: false
    life: dying
    application-status:
      current: waiting
      message: waiting for machine
      since: 05 Jun 2018 15:51:19+02:00
    relations:
      kubernetes:
      - deve
    endpoint-bindings:
      kafka: ""
      kubernetes: ""
      upstream: ""
application-endpoints:
  deve:
    url: vmware-main:sborny/sborny-tutorial.deve
    endpoints:
      kubernetes-deployer:
        interface: kubernetes-deployer
        role: provider
    life: dying
    application-status:
      current: error
      message: 'cannot get discharge from "https://10.10.139.74:17070/offeraccess":
        cannot acquire discharge: cannot http POST to "https://10.10.139.74:17070/offeraccess/discharge":
        Post https://10.10.139.74:17070/offeraccess/discharge: net/http: TLS handshake
        timeout'
      since: 10 Aug 2018 07:05:59+02:00
    relations:
      kubernetes-deployer:
      - kafka-rest-k8s
  leggo:
    url: vmware-main:sborny/sborny-tutorial.leggo
    endpoints:
      kubernetes-deployer:
        interface: kubernetes-deployer
        role: provider
    life: dying
    application-status:
      current: active
      message: Ready
      since: 11 Sep 2018 12:44:30+02:00
    relations:
      kubernetes-deployer:
      - kafka-rest
controller:
  timestamp: 10:54:24+02:00

I have a sleu of models in this state across my 3 controllers and experiencing this in 20+ of my JAAS models. @uros-jovanovic is looking into my JAAS models, possibly he has some input here.

2 Likes

Did anybody find any resolution to this? Am getting the same thing for models with cross model relations. Exactly the same status output as @merlijn-sebrechts

Same also here with stale cross model relation

Whether due to a stale cross model relation, or a unit currently in a hook error state, or a cloud API error, or a number of other reasons, removing applications can become stuck if the “do everything properly” workflow is not possible. We are this cycle going to address the issue by:

  • remove-application --force
  • remove-unit --force
  • destroy-model --continue-on-error

Unfortunately the fix is still in progress so there’s nothing easy that can be done right now to solve the issue.

3 Likes

Awesome!

How will this work with existing models that can’t upgrade to a newer version (because they’re destroying or in an error state). Will we still be able to use these commands to remove those models?

Once the controller and models are upgraded, the commands will work with existing models to get them cleaned up.

Just checked and I’m still able to upgrade the broken models, so this should be fine. Thanks!

Will destroy-model --continue-on-error be available in 2.6? I’m trying 2.6 beta1 now and don’t have it available.

I have a similar situation to the above, where I did a juju destroy-model. The application was removed. Both juju status and juju list-models shows 0 machines in the model but running juju destroy-model tries forever to remove a machine and application.

@aisrael,
We are providing ‘–force’ option on many commands including ‘destroy-model’, ‘remove-application’, ‘remove-relation’, ‘remove-unit’ specifically to un-stuck stuck removals and destructions in 2.6.

It will help your case. We are testing scenarios very similar to what you are describing - a model destruction is stuck on removing application and running ‘remove-application --force’ in a separate window allows the model destruction to proceed and to succeed. Obviously, with ‘–force’ on ‘destroy-model’ itself, you will not need a separate command in a different window.

I am not too sure where you are getting ‘continue-on-error’ from… Maybe it was discussed or planned at some stage? We have decided to settle on ‘–force’ as it is more intuitive and consistent with our existing terminology.

Hi @anastasia-macmood,

The continue-on-error was mentioned above by @wallyworld in reference to destroying a model.

The situation I’ve found is that the application is removed, but the model is still stuck in destroying. It’s like there’s an internal state out of sync. juju status on the model shows no applications are deployed, but destroy-model reports that it’s trying to remove a machine and application.

At that point, there’s no visible application for me to force the removal of, which is why I asked about forcing the model destruction as well.