Can't remove model, last application "waiting for machine"

merlijn-sebrechts · 20 September 2018 08:53

Hi all

I have a model that’s been trying to destroy itself for a few days now. It hangs on “model not empty” because one application is in the state “waiting for machine”.

Is there a way to force-destroy this model?

model:
  name: cot-dev2
  type: iaas
  controller: vmware-main
  cloud: vmware1
  region: ILABT
  version: 2.3.3
  model-status:
    current: destroying
    message: 'attempt 1 to destroy model failed (will retry):  model not empty, found
      1 application (model not empty)'
    since: 19 Sep 2018 14:18:20+02:00
  meter-status:
    color: amber
    message: user verification pending
  sla: unsupported
machines: {}
applications:
  nginx-api-gateway:
    charm: local:xenial/nginx-api-gateway-0
    series: xenial
    os: ubuntu
    charm-origin: local
    charm-name: nginx-api-gateway
    charm-rev: 0
    exposed: false
    life: dying
    application-status:
      current: waiting
      message: waiting for machine
      since: 18 Feb 2018 21:56:24+01:00
    endpoint-bindings:
      upstream: ""
      website: ""
controller:
  timestamp: 10:50:08+02:00

merlijn-sebrechts · 20 September 2018 09:01

We have a bunch of models that are in this state. This might actually be the reason for the 40Mb/s traffic

thumper · 20 September 2018 09:55

It could well be.

For quite some time we have thought about having a destroy-model --force for those situations where things get stuck when they shouldn’t.

One of the underlying bugs here I think is that the application destruction shouldn’t be blocked by the waiting for machine in this situation (and probably a few others).

jameinel · 20 September 2018 10:34

This seems a lot like lp:1654928, which claims to be fixed. So maybe this is a different issue.

I believe we’ve talked about having a “remove-application --force” of some sort. That said, I think we can just fix this one. If an application has a unit that has no machine available for it, we should just allow that unit to be removed and the application killed.

merlijn-sebrechts · 20 September 2018 10:42

The reason we want to remove some of these models is exactly because of this “stuck” application. We hoped removing the model would fix it. So remove-application --force would actually help us even more, since it means we don’t have to remove the model anymore.

Is there a workaround I can use to remove these applications and models currently, while waiting for the fix? I’d like to see if that reduces the network traffic.

merlijn-sebrechts · 20 September 2018 11:36

I think it resembles this bug report more: Bug #1724673 “unable to destroy-model with offer in it” : Bugs : juju

All the “stuck” applications have had or provided cross-model relations. Below is a yaml status of another model that fails to destroy. Here you can still see some of the remnants of cross-model relations.

juju status --format yaml
model:
  name: providenceplus
  type: iaas
  controller: vmware-main
  cloud: vmware1
  region: ILABT
  version: 2.3.8
  model-status:
    current: destroying
    message: 'attempt 39 to destroy model failed (will retry):  model not empty, found
      2 applications (model not empty)'
    since: 20 Sep 2018 10:31:00+02:00
  meter-status:
    color: amber
    message: user verification pending
  sla: unsupported
machines: {}
applications:
  kafka-rest:
    charm: local:xenial/kafka-rest-confluent-k8s-3
    series: xenial
    os: ubuntu
    charm-origin: local
    charm-name: kafka-rest-confluent-k8s
    charm-rev: 3
    exposed: false
    life: dying
    application-status:
      current: waiting
      message: waiting for machine
      since: 20 Jun 2018 09:34:30+02:00
    relations:
      kubernetes:
      - leggo
    endpoint-bindings:
      kafka: ""
      kubernetes: ""
      upstream: ""
  kafka-rest-k8s:
    charm: local:xenial/kafka-rest-confluent-k8s-0
    series: xenial
    os: ubuntu
    charm-origin: local
    charm-name: kafka-rest-confluent-k8s
    charm-rev: 0
    exposed: false
    life: dying
    application-status:
      current: waiting
      message: waiting for machine
      since: 05 Jun 2018 15:51:19+02:00
    relations:
      kubernetes:
      - deve
    endpoint-bindings:
      kafka: ""
      kubernetes: ""
      upstream: ""
application-endpoints:
  deve:
    url: vmware-main:sborny/sborny-tutorial.deve
    endpoints:
      kubernetes-deployer:
        interface: kubernetes-deployer
        role: provider
    life: dying
    application-status:
      current: error
      message: 'cannot get discharge from "https://10.10.139.74:17070/offeraccess":
        cannot acquire discharge: cannot http POST to "https://10.10.139.74:17070/offeraccess/discharge":
        Post https://10.10.139.74:17070/offeraccess/discharge: net/http: TLS handshake
        timeout'
      since: 10 Aug 2018 07:05:59+02:00
    relations:
      kubernetes-deployer:
      - kafka-rest-k8s
  leggo:
    url: vmware-main:sborny/sborny-tutorial.leggo
    endpoints:
      kubernetes-deployer:
        interface: kubernetes-deployer
        role: provider
    life: dying
    application-status:
      current: active
      message: Ready
      since: 11 Sep 2018 12:44:30+02:00
    relations:
      kubernetes-deployer:
      - kafka-rest
controller:
  timestamp: 10:54:24+02:00

jamesbeedy · 20 September 2018 13:03

I have a sleu of models in this state across my 3 controllers and experiencing this in 20+ of my JAAS models. @uros-jovanovic is looking into my JAAS models, possibly he has some input here.

emcs2 · 27 January 2019 10:24

Did anybody find any resolution to this? Am getting the same thing for models with cross model relations. Exactly the same status output as @merlijn-sebrechts

soumplis · 28 January 2019 10:26

Same also here with stale cross model relation

wallyworld · 28 January 2019 23:31

Whether due to a stale cross model relation, or a unit currently in a hook error state, or a cloud API error, or a number of other reasons, removing applications can become stuck if the “do everything properly” workflow is not possible. We are this cycle going to address the issue by:

remove-application --force
remove-unit --force
destroy-model --continue-on-error

Unfortunately the fix is still in progress so there’s nothing easy that can be done right now to solve the issue.

merlijn-sebrechts · 29 January 2019 09:35

Awesome!

How will this work with existing models that can’t upgrade to a newer version (because they’re destroying or in an error state). Will we still be able to use these commands to remove those models?

wallyworld · 29 January 2019 11:24

Once the controller and models are upgraded, the commands will work with existing models to get them cleaned up.

merlijn-sebrechts · 29 January 2019 12:32

Just checked and I’m still able to upgrade the broken models, so this should be fine. Thanks!

aisrael · 23 April 2019 19:07

Will destroy-model --continue-on-error be available in 2.6? I’m trying 2.6 beta1 now and don’t have it available.

I have a similar situation to the above, where I did a juju destroy-model. The application was removed. Both juju status and juju list-models shows 0 machines in the model but running juju destroy-model tries forever to remove a machine and application.

anastasia-macmood · 24 April 2019 00:47

@aisrael,
We are providing ‘–force’ option on many commands including ‘destroy-model’, ‘remove-application’, ‘remove-relation’, ‘remove-unit’ specifically to un-stuck stuck removals and destructions in 2.6.

It will help your case. We are testing scenarios very similar to what you are describing - a model destruction is stuck on removing application and running ‘remove-application --force’ in a separate window allows the model destruction to proceed and to succeed. Obviously, with ‘–force’ on ‘destroy-model’ itself, you will not need a separate command in a different window.

I am not too sure where you are getting ‘continue-on-error’ from… Maybe it was discussed or planned at some stage? We have decided to settle on ‘–force’ as it is more intuitive and consistent with our existing terminology.

aisrael · 24 April 2019 13:48

Hi @anastasia-macmood,

The continue-on-error was mentioned above by @wallyworld in reference to destroying a model.

The situation I’ve found is that the application is removed, but the model is still stuck in destroying. It’s like there’s an internal state out of sync. juju status on the model shows no applications are deployed, but destroy-model reports that it’s trying to remove a machine and application.

At that point, there’s no visible application for me to force the removal of, which is why I asked about forcing the model destruction as well.