Juju on vSphere, datacenter and credentials lost (+ workaround)


#1

Long story short:

PROBLEM: Juju suddenly silently fails to communicate with vCenter server, no errors in the logs, says “datacenter not found” when trying to update controller

WORKAROUND: Solved by updating (to the same) credentials on the controller. Affected all models on one controller.

Not sure if this is this is the correct forum for this, but maybe it will help someone. I still have models with this problem left in my controller if someone wants more info.

Scenario:

Juju version 2.5.1 on the controller and other models in this scenario.
ESXi 6.5, build 11925212
vCenter 6.7.0 build 10244857
Possibly important: We also use Candid.

For full context: Me and @erik-lonroth rapidly deployed close to 100 machines with juju on a vSphere cloud, by deploying about 25 copies of the slurm-core bundle (https://jujucharms.com/u/omnivector/slurm-core/). All deployments were successful, but after walking through the 25 models and adding yet another unit of the slurm-node charm on each, the controller seems to have lost all means of communication with the vCenter server.

The result was, we could not add units (machines) anymore, we could not delete models and machines. As an example, the following output shows the juju status at some point for a model that I have attempted to destroy. Machines 4-5 has failed to deploy, but machines 0-3 are actually up and running just fine.

$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm-u iuba-vmware vmware01-prod/Sodertalje-HPC 2.5.1 unsupported 13:57:02+02:00

Machine  State    DNS             Inst id        Series  AZ  Message
0        stopped  10.104.129.171  juju-3ae1b0-0  xenial      poweredOn    
1        stopped  10.104.129.44   juju-3ae1b0-1  disco       poweredOn
2        stopped  10.104.129.45   juju-3ae1b0-2  disco       poweredOn
3        stopped  10.104.129.46   juju-3ae1b0-3  disco       poweredOn     
4        pending                  pending        disco         
5        pending                  pending        disco

I tried to access the controller and dig around in /var/log/juju, this gave me nothing that was easy to interpret. There was a lot of TCP connections between the juju controller and vCenter server in the TIME_WAIT status. Desperately tried restart of Juju services on the controller and even rebooted the controller machine itself, to no avail.

No errors are visible to me in the vCenter web interface.

As I noticed we were not completely up to date, I tried to upgrade juju. First:

$ juju upgrade-juju
best version:
    2.5.4                                                                                                                                                                                                                                                                      
ERROR model cannot be upgraded to 2.5.4 while the controller is 2.5.1: upgrade 'controller' model first                                                                                                                                        

Ok, fair enough. On the controller model:

$ juju upgrade-juju                                                           
best version:     
    2.5.4   
ERROR cannot make API call to provider: datacenter 'Sodertalje-HPC' not found

This is wierd, this is indeed the name of our datacenter in vCenter. I started to suspect something fishy with credentials after all, even though nothing had changed.

hallback@t1000:~$ juju list-credentials                      
Cloud           Credentials                                  
vmware01-prod   johanh*                                      
                                                         
hallback@t1000:~$ juju show-credential vmware01-prod johanh                                
controller-credentials:                                      
  vmware01-prod:                                             
    johanh:                                                  
      content:                                               
        auth-type: userpass                                  
        user: johanh@domain.company.com                      
      models: {}                                             
                                                         
hallback@t1000:~$ juju set-credential -m controller vmware01-prod johanh                   
Found credential remotely, on the controller. Not looking locally...                       
Changed cloud credential on model "controller" to "johanh".                                
hallback@t1000:~$ juju show-credential vmware01-prod johanh                                
controller-credentials:                                      
  vmware01-prod:                                             
    johanh:                                                  
      content:                                               
        auth-type: userpass                                  
        user: johanh@domain.company.com                      
      models:                                                
        controller: admin                                    
                                                         
hallback@t1000:~$ juju upgrade-juju                          
best version:                                                
    2.5.4                                                    
started upgrade to 2.5.4                                     

Ok! Finally something works.

hallback@t1000:~$ juju status                                
Model       Controller   Cloud/Region                  Version  SLA          Timestamp     
controller  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.4    unsupported  14:03:20+02:00
                                                             
Machine  State    DNS             Inst id        Series  AZ  Message                       
0        started  10.104.129.212  juju-83eb7f-0  bionic      poweredOn                     

Now my idea was to go back to one of the faulty models and continue my work with the original problem:

hallback@t1000:~$ juju switch slurm-u                        
iuba-vmware:admin/controller -> iuba-vmware:JHALLBACK@domain/slurm-u                       
hallback@t1000:~$ juju status                                
Model    Controller   Cloud/Region                  Version  SLA          Timestamp        
slurm-u  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.1    unsupported  14:04:38+02:00   
                                                             
Machine  State    DNS             Inst id        Series  AZ  Message                       
0        stopped  10.104.129.171  juju-3ae1b0-0  xenial      poweredOn                     
1        stopped  10.104.129.44   juju-3ae1b0-1  disco       poweredOn                     
2        stopped  10.104.129.45   juju-3ae1b0-2  disco       poweredOn                     
3        stopped  10.104.129.46   juju-3ae1b0-3  disco       poweredOn                     
4        pending                  pending        disco                                     
5        pending                  pending        disco                                     

Still no sign of life. Ok, so let’s upgrade this one too:

hallback@t1000:~$ juju upgrade-juju
best version:
    2.5.4
ERROR some agents have not upgraded to the current model version 2.5.1: machine-4, machine-5

hallback@t1000:~$ juju remove-machine 4 --force
removing machine 4
hallback@t1000:~$ juju remove-machine 5 --force
removing machine 5

(The machines would not disappear according to the output of juju status)

hallback@t1000:~$ juju add-credential --replace vmware01-prod
Enter credential name: johanh

A credential "johanh" already exists locally on this client.
Replace local credential? (y/N): y

Using auth-type "userpass".

Enter user: johanh@domain.company.com

Enter password: 

Credential "johanh" updated locally for cloud "vmware01-prod".

hallback@t1000:~$ juju update-credential vmware01-prod johanh                 
Credential valid for:
  slurm-f
  slurm-b
  slurm-p
  slurm-k
  slurm-e
  slurm-c
  slurm-t
  slurm-g
  slurm-l
  slurm-a
  slurm-h
  slurm-i
  slurm-q
  slurm-m
  slurm-u
  slurm-n
  slurm-d
  slurm-s
  slurm-o
  slurm-r
  slurm-j
Controller credential "johanh" for user "JHALLBACK@domain" on cloud "vmware01-prod" updated.
For more information, see ‘juju show-credential vmware01-prod johanh’.

After this step, everything starts to work!

hallback@t1000:~$ juju status
ERROR model iuba-vmware:JHALLBACK@domain/slurm-u not found

The model is now deleted! This should have happened hours ago. The machines (VM:s) were also immediately deleted in VMware.

Now, retrying this method on some other model that I haven’t issued a “destroy” on yet:

hallback@t1000:~$ juju switch slurm-m
iuba-vmware:admin/controller -> iuba-vmware:JHALLBACK@domain/slurm-m

hallback@t1000:~$ juju add-unit slurm-node
hallback@t1000:~$ juju status
Model    Controller   Cloud/Region                  Version  SLA          Timestamp
slurm-m  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.1    unsupported  11:58:58+02:00

App               Version    Status   Scale  Charm             Store       Rev  OS      Notes
mysql             5.7.25     active       1  mysql             jujucharms   58  ubuntu  
slurm-controller  18.08.5.2  active       1  slurm-controller  jujucharms    4  ubuntu  
slurm-dbd         18.08.5.2  active       1  slurm-dbd         jujucharms    1  ubuntu  
slurm-node                   waiting    1/2  slurm-node        jujucharms    6  ubuntu  

Unit                 Workload  Agent       Machine  Public address  Ports     Message
mysql/0*             active    idle        0        10.104.129.243  3306/tcp  Ready
slurm-controller/0*  active    idle        2        10.104.129.206            Ready
slurm-dbd/0*         active    idle        1        10.104.129.169            Ready
slurm-node/0*        active    idle        3        10.104.129.133            Ready
slurm-node/1         waiting   allocating  4                                  waiting for machine

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.104.129.243  juju-9048b9-0  xenial      poweredOn
1        started  10.104.129.169  juju-9048b9-1  disco       poweredOn
2        started  10.104.129.206  juju-9048b9-2  disco       poweredOn
3        started  10.104.129.133  juju-9048b9-3  disco       poweredOn
4        pending                  pending        disco

…and we wait more than ten minutes, nothing happens…

hallback@t1000:~$ juju status
Model    Controller   Cloud/Region                  Version  SLA          Timestamp
slurm-m  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.1    unsupported  12:09:37+02:00

App               Version    Status   Scale  Charm             Store       Rev  OS      Notes
mysql             5.7.25     active       1  mysql             jujucharms   58  ubuntu  
slurm-controller  18.08.5.2  active       1  slurm-controller  jujucharms    4  ubuntu  
slurm-dbd         18.08.5.2  active       1  slurm-dbd         jujucharms    1  ubuntu  
slurm-node                   waiting    1/2  slurm-node        jujucharms    6  ubuntu  

Unit                 Workload  Agent       Machine  Public address  Ports     Message
mysql/0*             active    idle        0        10.104.129.243  3306/tcp  Ready
slurm-controller/0*  active    idle        2        10.104.129.206            Ready
slurm-dbd/0*         active    idle        1        10.104.129.169            Ready
slurm-node/0*        active    idle        3        10.104.129.133            Ready
slurm-node/1         waiting   allocating  4                                  waiting for machine

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.104.129.243  juju-9048b9-0  xenial      poweredOn
1        started  10.104.129.169  juju-9048b9-1  disco       poweredOn
2        started  10.104.129.206  juju-9048b9-2  disco       poweredOn
3        started  10.104.129.133  juju-9048b9-3  disco       poweredOn
4        pending                  pending        disco

hallback@t1000:~$ juju whoami
Controller:  iuba-vmware
Model:       slurm-m
User:        JHALLBACK@domain

hallback@t1000:~$ juju list-credentials vmware01-prod 
Cloud          Credentials
vmware01-prod  johanh*

hallback@t1000:~$ juju show-credential vmware01-prod johanh
controller-credentials:
  vmware01-prod:
    johanh:
      content:
        auth-type: userpass
        user: johanh@domain.company.com
      models:
        slurm-a: admin
        slurm-b: admin
        slurm-c: admin
        slurm-d: admin
        slurm-e: admin
        slurm-f: admin
        slurm-g: admin
        slurm-h: admin
        slurm-i: admin
        slurm-j: admin
        slurm-k: admin
        slurm-l: admin
        slurm-m: admin
        slurm-n: admin
        slurm-o: admin
        slurm-p: admin
        slurm-t: admin

Ok, everything SHOULD be fine. Now at 12:13:33+02:00, i issue the following command:

hallback@t1000:~$ juju update-credential vmware01-prod johanh
Credential valid for:
  slurm-f
  slurm-b
  slurm-p
  slurm-k
  slurm-e
  slurm-c
  slurm-t
  slurm-g
  slurm-l
  slurm-a
  slurm-h
  slurm-i
  slurm-m
  slurm-n
  slurm-d
  slurm-o
  slurm-j
Controller credential "johanh" for user "JHALLBACK@domain" on cloud "vmware01-prod" updated.
For more information, see ‘juju show-credential vmware01-prod johanh’.

The machine starts to deploy within 10 seconds, the status is set to poweredOn:

hallback@t1000:~$ juju status
Model    Controller   Cloud/Region                  Version  SLA          Timestamp
slurm-m  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.1    unsupported  12:14:59+02:00

App               Version    Status   Scale  Charm             Store       Rev  OS      Notes
mysql             5.7.25     active       1  mysql             jujucharms   58  ubuntu  
slurm-controller  18.08.5.2  active       1  slurm-controller  jujucharms    4  ubuntu  
slurm-dbd         18.08.5.2  active       1  slurm-dbd         jujucharms    1  ubuntu  
slurm-node                   waiting    1/2  slurm-node        jujucharms    6  ubuntu  

Unit                 Workload  Agent       Machine  Public address  Ports     Message
mysql/0*             active    idle        0        10.104.129.243  3306/tcp  Ready
slurm-controller/0*  active    idle        2        10.104.129.206            Ready
slurm-dbd/0*         active    idle        1        10.104.129.169            Ready
slurm-node/0*        active    idle        3        10.104.129.133            Ready
slurm-node/1         waiting   allocating  4                                  waiting for machine

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.104.129.243  juju-9048b9-0  xenial      poweredOn
1        started  10.104.129.169  juju-9048b9-1  disco       poweredOn
2        started  10.104.129.206  juju-9048b9-2  disco       poweredOn
3        started  10.104.129.133  juju-9048b9-3  disco       poweredOn
4        pending                  juju-9048b9-4  disco       poweredOn

After some more minutes the unit is up and is related:

$ juju status
Model    Controller   Cloud/Region                  Version  SLA          Timestamp
slurm-m  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.1    unsupported  12:17:41+02:00

App               Version    Status  Scale  Charm             Store       Rev  OS      Notes
mysql             5.7.25     active      1  mysql             jujucharms   58  ubuntu  
slurm-controller  18.08.5.2  active      1  slurm-controller  jujucharms    4  ubuntu  
slurm-dbd         18.08.5.2  active      1  slurm-dbd         jujucharms    1  ubuntu  
slurm-node        18.08.6.2  active      2  slurm-node        jujucharms    6  ubuntu  

Unit                 Workload  Agent  Machine  Public address  Ports     Message
mysql/0*             active    idle   0        10.104.129.243  3306/tcp  Ready
slurm-controller/0*  active    idle   2        10.104.129.206            Ready
slurm-dbd/0*         active    idle   1        10.104.129.169            Ready
slurm-node/0*        active    idle   3        10.104.129.133            Ready
slurm-node/1         active    idle   4        10.104.129.23             Ready

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.104.129.243  juju-9048b9-0  xenial      poweredOn
1        started  10.104.129.169  juju-9048b9-1  disco       poweredOn
2        started  10.104.129.206  juju-9048b9-2  disco       poweredOn
3        started  10.104.129.133  juju-9048b9-3  disco       poweredOn
4        started  10.104.129.23   juju-9048b9-4  disco       poweredOn