Juju on vSphere, datacenter and credentials lost (+ workaround)


#1

Long story short:

PROBLEM: Juju suddenly silently fails to communicate with vCenter server, no errors in the logs, says “datacenter not found” when trying to update controller

WORKAROUND: Solved by updating (to the same) credentials on the controller. Affected all models on one controller.

Not sure if this is this is the correct forum for this, but maybe it will help someone. I still have models with this problem left in my controller if someone wants more info.

Scenario:

Juju version 2.5.1 on the controller and other models in this scenario.
ESXi 6.5, build 11925212
vCenter 6.7.0 build 10244857
Possibly important: We also use Candid.

For full context: Me and @erik-lonroth rapidly deployed close to 100 machines with juju on a vSphere cloud, by deploying about 25 copies of the slurm-core bundle (https://jujucharms.com/u/omnivector/slurm-core/). All deployments were successful, but after walking through the 25 models and adding yet another unit of the slurm-node charm on each, the controller seems to have lost all means of communication with the vCenter server.

The result was, we could not add units (machines) anymore, we could not delete models and machines. As an example, the following output shows the juju status at some point for a model that I have attempted to destroy. Machines 4-5 has failed to deploy, but machines 0-3 are actually up and running just fine.

$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm-u iuba-vmware vmware01-prod/Sodertalje-HPC 2.5.1 unsupported 13:57:02+02:00

Machine  State    DNS             Inst id        Series  AZ  Message
0        stopped  10.104.129.171  juju-3ae1b0-0  xenial      poweredOn    
1        stopped  10.104.129.44   juju-3ae1b0-1  disco       poweredOn
2        stopped  10.104.129.45   juju-3ae1b0-2  disco       poweredOn
3        stopped  10.104.129.46   juju-3ae1b0-3  disco       poweredOn     
4        pending                  pending        disco         
5        pending                  pending        disco

I tried to access the controller and dig around in /var/log/juju, this gave me nothing that was easy to interpret. There was a lot of TCP connections between the juju controller and vCenter server in the TIME_WAIT status. Desperately tried restart of Juju services on the controller and even rebooted the controller machine itself, to no avail.

No errors are visible to me in the vCenter web interface.

As I noticed we were not completely up to date, I tried to upgrade juju. First:

$ juju upgrade-juju
best version:
    2.5.4                                                                                                                                                                                                                                                                      
ERROR model cannot be upgraded to 2.5.4 while the controller is 2.5.1: upgrade 'controller' model first                                                                                                                                        

Ok, fair enough. On the controller model:

$ juju upgrade-juju                                                           
best version:     
    2.5.4   
ERROR cannot make API call to provider: datacenter 'Sodertalje-HPC' not found

This is wierd, this is indeed the name of our datacenter in vCenter. I started to suspect something fishy with credentials after all, even though nothing had changed.

hallback@t1000:~$ juju list-credentials                      
Cloud           Credentials                                  
vmware01-prod   johanh*                                      
                                                         
hallback@t1000:~$ juju show-credential vmware01-prod johanh                                
controller-credentials:                                      
  vmware01-prod:                                             
    johanh:                                                  
      content:                                               
        auth-type: userpass                                  
        user: johanh@domain.company.com                      
      models: {}                                             
                                                         
hallback@t1000:~$ juju set-credential -m controller vmware01-prod johanh                   
Found credential remotely, on the controller. Not looking locally...                       
Changed cloud credential on model "controller" to "johanh".                                
hallback@t1000:~$ juju show-credential vmware01-prod johanh                                
controller-credentials:                                      
  vmware01-prod:                                             
    johanh:                                                  
      content:                                               
        auth-type: userpass                                  
        user: johanh@domain.company.com                      
      models:                                                
        controller: admin                                    
                                                         
hallback@t1000:~$ juju upgrade-juju                          
best version:                                                
    2.5.4                                                    
started upgrade to 2.5.4                                     

Ok! Finally something works.

hallback@t1000:~$ juju status                                
Model       Controller   Cloud/Region                  Version  SLA          Timestamp     
controller  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.4    unsupported  14:03:20+02:00
                                                             
Machine  State    DNS             Inst id        Series  AZ  Message                       
0        started  10.104.129.212  juju-83eb7f-0  bionic      poweredOn                     

Now my idea was to go back to one of the faulty models and continue my work with the original problem:

hallback@t1000:~$ juju switch slurm-u                        
iuba-vmware:admin/controller -> iuba-vmware:JHALLBACK@domain/slurm-u                       
hallback@t1000:~$ juju status                                
Model    Controller   Cloud/Region                  Version  SLA          Timestamp        
slurm-u  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.1    unsupported  14:04:38+02:00   
                                                             
Machine  State    DNS             Inst id        Series  AZ  Message                       
0        stopped  10.104.129.171  juju-3ae1b0-0  xenial      poweredOn                     
1        stopped  10.104.129.44   juju-3ae1b0-1  disco       poweredOn                     
2        stopped  10.104.129.45   juju-3ae1b0-2  disco       poweredOn                     
3        stopped  10.104.129.46   juju-3ae1b0-3  disco       poweredOn                     
4        pending                  pending        disco                                     
5        pending                  pending        disco                                     

Still no sign of life. Ok, so let’s upgrade this one too:

hallback@t1000:~$ juju upgrade-juju
best version:
    2.5.4
ERROR some agents have not upgraded to the current model version 2.5.1: machine-4, machine-5

hallback@t1000:~$ juju remove-machine 4 --force
removing machine 4
hallback@t1000:~$ juju remove-machine 5 --force
removing machine 5

(The machines would not disappear according to the output of juju status)

hallback@t1000:~$ juju add-credential --replace vmware01-prod
Enter credential name: johanh

A credential "johanh" already exists locally on this client.
Replace local credential? (y/N): y

Using auth-type "userpass".

Enter user: johanh@domain.company.com

Enter password: 

Credential "johanh" updated locally for cloud "vmware01-prod".

hallback@t1000:~$ juju update-credential vmware01-prod johanh                 
Credential valid for:
  slurm-f
  slurm-b
  slurm-p
  slurm-k
  slurm-e
  slurm-c
  slurm-t
  slurm-g
  slurm-l
  slurm-a
  slurm-h
  slurm-i
  slurm-q
  slurm-m
  slurm-u
  slurm-n
  slurm-d
  slurm-s
  slurm-o
  slurm-r
  slurm-j
Controller credential "johanh" for user "JHALLBACK@domain" on cloud "vmware01-prod" updated.
For more information, see ‘juju show-credential vmware01-prod johanh’.

After this step, everything starts to work!

hallback@t1000:~$ juju status
ERROR model iuba-vmware:JHALLBACK@domain/slurm-u not found

The model is now deleted! This should have happened hours ago. The machines (VM:s) were also immediately deleted in VMware.

Now, retrying this method on some other model that I haven’t issued a “destroy” on yet:

hallback@t1000:~$ juju switch slurm-m
iuba-vmware:admin/controller -> iuba-vmware:JHALLBACK@domain/slurm-m

hallback@t1000:~$ juju add-unit slurm-node
hallback@t1000:~$ juju status
Model    Controller   Cloud/Region                  Version  SLA          Timestamp
slurm-m  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.1    unsupported  11:58:58+02:00

App               Version    Status   Scale  Charm             Store       Rev  OS      Notes
mysql             5.7.25     active       1  mysql             jujucharms   58  ubuntu  
slurm-controller  18.08.5.2  active       1  slurm-controller  jujucharms    4  ubuntu  
slurm-dbd         18.08.5.2  active       1  slurm-dbd         jujucharms    1  ubuntu  
slurm-node                   waiting    1/2  slurm-node        jujucharms    6  ubuntu  

Unit                 Workload  Agent       Machine  Public address  Ports     Message
mysql/0*             active    idle        0        10.104.129.243  3306/tcp  Ready
slurm-controller/0*  active    idle        2        10.104.129.206            Ready
slurm-dbd/0*         active    idle        1        10.104.129.169            Ready
slurm-node/0*        active    idle        3        10.104.129.133            Ready
slurm-node/1         waiting   allocating  4                                  waiting for machine

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.104.129.243  juju-9048b9-0  xenial      poweredOn
1        started  10.104.129.169  juju-9048b9-1  disco       poweredOn
2        started  10.104.129.206  juju-9048b9-2  disco       poweredOn
3        started  10.104.129.133  juju-9048b9-3  disco       poweredOn
4        pending                  pending        disco

…and we wait more than ten minutes, nothing happens…

hallback@t1000:~$ juju status
Model    Controller   Cloud/Region                  Version  SLA          Timestamp
slurm-m  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.1    unsupported  12:09:37+02:00

App               Version    Status   Scale  Charm             Store       Rev  OS      Notes
mysql             5.7.25     active       1  mysql             jujucharms   58  ubuntu  
slurm-controller  18.08.5.2  active       1  slurm-controller  jujucharms    4  ubuntu  
slurm-dbd         18.08.5.2  active       1  slurm-dbd         jujucharms    1  ubuntu  
slurm-node                   waiting    1/2  slurm-node        jujucharms    6  ubuntu  

Unit                 Workload  Agent       Machine  Public address  Ports     Message
mysql/0*             active    idle        0        10.104.129.243  3306/tcp  Ready
slurm-controller/0*  active    idle        2        10.104.129.206            Ready
slurm-dbd/0*         active    idle        1        10.104.129.169            Ready
slurm-node/0*        active    idle        3        10.104.129.133            Ready
slurm-node/1         waiting   allocating  4                                  waiting for machine

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.104.129.243  juju-9048b9-0  xenial      poweredOn
1        started  10.104.129.169  juju-9048b9-1  disco       poweredOn
2        started  10.104.129.206  juju-9048b9-2  disco       poweredOn
3        started  10.104.129.133  juju-9048b9-3  disco       poweredOn
4        pending                  pending        disco

hallback@t1000:~$ juju whoami
Controller:  iuba-vmware
Model:       slurm-m
User:        JHALLBACK@domain

hallback@t1000:~$ juju list-credentials vmware01-prod 
Cloud          Credentials
vmware01-prod  johanh*

hallback@t1000:~$ juju show-credential vmware01-prod johanh
controller-credentials:
  vmware01-prod:
    johanh:
      content:
        auth-type: userpass
        user: johanh@domain.company.com
      models:
        slurm-a: admin
        slurm-b: admin
        slurm-c: admin
        slurm-d: admin
        slurm-e: admin
        slurm-f: admin
        slurm-g: admin
        slurm-h: admin
        slurm-i: admin
        slurm-j: admin
        slurm-k: admin
        slurm-l: admin
        slurm-m: admin
        slurm-n: admin
        slurm-o: admin
        slurm-p: admin
        slurm-t: admin

Ok, everything SHOULD be fine. Now at 12:13:33+02:00, i issue the following command:

hallback@t1000:~$ juju update-credential vmware01-prod johanh
Credential valid for:
  slurm-f
  slurm-b
  slurm-p
  slurm-k
  slurm-e
  slurm-c
  slurm-t
  slurm-g
  slurm-l
  slurm-a
  slurm-h
  slurm-i
  slurm-m
  slurm-n
  slurm-d
  slurm-o
  slurm-j
Controller credential "johanh" for user "JHALLBACK@domain" on cloud "vmware01-prod" updated.
For more information, see ‘juju show-credential vmware01-prod johanh’.

The machine starts to deploy within 10 seconds, the status is set to poweredOn:

hallback@t1000:~$ juju status
Model    Controller   Cloud/Region                  Version  SLA          Timestamp
slurm-m  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.1    unsupported  12:14:59+02:00

App               Version    Status   Scale  Charm             Store       Rev  OS      Notes
mysql             5.7.25     active       1  mysql             jujucharms   58  ubuntu  
slurm-controller  18.08.5.2  active       1  slurm-controller  jujucharms    4  ubuntu  
slurm-dbd         18.08.5.2  active       1  slurm-dbd         jujucharms    1  ubuntu  
slurm-node                   waiting    1/2  slurm-node        jujucharms    6  ubuntu  

Unit                 Workload  Agent       Machine  Public address  Ports     Message
mysql/0*             active    idle        0        10.104.129.243  3306/tcp  Ready
slurm-controller/0*  active    idle        2        10.104.129.206            Ready
slurm-dbd/0*         active    idle        1        10.104.129.169            Ready
slurm-node/0*        active    idle        3        10.104.129.133            Ready
slurm-node/1         waiting   allocating  4                                  waiting for machine

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.104.129.243  juju-9048b9-0  xenial      poweredOn
1        started  10.104.129.169  juju-9048b9-1  disco       poweredOn
2        started  10.104.129.206  juju-9048b9-2  disco       poweredOn
3        started  10.104.129.133  juju-9048b9-3  disco       poweredOn
4        pending                  juju-9048b9-4  disco       poweredOn

After some more minutes the unit is up and is related:

$ juju status
Model    Controller   Cloud/Region                  Version  SLA          Timestamp
slurm-m  iuba-vmware  vmware01-prod/Sodertalje-HPC  2.5.1    unsupported  12:17:41+02:00

App               Version    Status  Scale  Charm             Store       Rev  OS      Notes
mysql             5.7.25     active      1  mysql             jujucharms   58  ubuntu  
slurm-controller  18.08.5.2  active      1  slurm-controller  jujucharms    4  ubuntu  
slurm-dbd         18.08.5.2  active      1  slurm-dbd         jujucharms    1  ubuntu  
slurm-node        18.08.6.2  active      2  slurm-node        jujucharms    6  ubuntu  

Unit                 Workload  Agent  Machine  Public address  Ports     Message
mysql/0*             active    idle   0        10.104.129.243  3306/tcp  Ready
slurm-controller/0*  active    idle   2        10.104.129.206            Ready
slurm-dbd/0*         active    idle   1        10.104.129.169            Ready
slurm-node/0*        active    idle   3        10.104.129.133            Ready
slurm-node/1         active    idle   4        10.104.129.23             Ready

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.104.129.243  juju-9048b9-0  xenial      poweredOn
1        started  10.104.129.169  juju-9048b9-1  disco       poweredOn
2        started  10.104.129.206  juju-9048b9-2  disco       poweredOn
3        started  10.104.129.133  juju-9048b9-3  disco       poweredOn
4        started  10.104.129.23   juju-9048b9-4  disco       poweredOn

#2

Thanx for the detailed description of the workaround and debugging it.
I have had exactly the same problem installing Kubernetes on vSphere the two last weeks and the symptoms you describe is exactly what I see. I cant find any errors anywhere in any logs, yet new machines hang in “waiting for machine” with no more explanation.

It seems like provisioning new models with juju or conjure-up works fine, and even adding/removing units right after also works fine but if you wait a day or so it will not work. ??? very strange, is the user session timing out?

With the description in you post, I updated the credentials of my model and managed to complete the provisioning of a new unit that had been waiting since yesterday. It started and completed immediately after updating the credentials.


#3

@hallback sorry that I missed this post when you first wrote it. Thank you for providing such a thorough write-up!


#4

Thanks @ellensen and @timClicks for the feedback!

We’ve been running 2.5.7 for a short while and yesterday me and my colleagues @xinyuem and @erik-lonroth figured out we all had our credentials “lost” again. Simply issuing "juju update-credential " fixed it for us again. I found myself repeating this several times yesterday.

We have now updated to 2.6.1. I’ve seen there has been some vSphere improvements which is great, but probably not related to credentials.

I have also noticed odd behaviour when machines exist in vSphere that the controller does not know about, or when the controller believes a machine exist that is removed in vSphere (may cause controller crash). I only mention this here because I think this happened because of the credential issue. Should anyone want more info related to this I’d be happy to provide that.


#5

@hallback
Since my last post above, I had the same problem with the credentials was lost again the day after.
Running “juju update-credential” and it immediately started to provision the machines successfully.
I suspect if I tried to add/remove units today I would have the same problem again.


#6

@ellensen, what you just wrote made me think of one thing, which I maybe should have done a long time ago.

When I’m using the vSphere Web Client by navigating to https://vcenter.fqdn/ui/ or https://vcenter.fqdn/vsphere-client/ I find myself automatically logged out after some time of no activity, probably 120 minutes which is the default timeout. This is regardless of which of the two front ends I use. Note that https://vcenter.fqdn is the same service as the endpoint the Juju controller talks to.

There is a possibility to change or remove this web client timeout on the vCenter server: https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.vcenterhost.doc/GUID-975412DE-CDCB-49A1-8E2A-0965325D33A5.html

I have not taken a look on the source code here, but If you are in control of your vCenter server (which I am not), try setting this timeout to 0 and restart the service. If the problem goes away, there’s the solution.


#7

@rick_h If it is like @hallback describes, this is likely something that needs to be clearly communicated for anyone using vsphere backed juju clouds as its a source for alot of problems otherwise.


#8

I’ve added a note to the vSphere page.

Some of the grief can be mitigated by having LP 1822117 fixed.

I do believe this should be reported as a Juju bug to see if anything can be done on that side of things.


#9

@hallback I tried to set the timeout to 0 and tried to remove some units I had created a couple of days ago. They waited in status stopped, but still had message poweredOn until I ran the update-credentials command again and both units where removed immediately. Setting timeout to 0 didnt seem to work.


#10

Thank you @ellensen for trying that, good to know. I’ll follow @pmatulis advice and file a bug report during the next days.


#11

We have the same issue, can you link the bug report when it’s made?


#12

Sorry folks, just getting back after some travel and catching up.

So this sounds like the code introduced that shuts down workers if the credentials come up invalid to avoid spamming the controller is having an adverse effect on this setup. For some reason the contact with vsphere comes out looking like invalid credentials and stops trying even though the credentials are in fact still good? I wonder if there’s some form of ratelimiting where we’re treating that limit reached response as invalid credentials?

Definitely please file a bug and the closer we can get to repro steps the better. I guess it kind of makes sense but we’ll have to learn how to better distinguish invalid credentials vs some sort of limit/bounce from the vsphere api.


#13

@hallback is doing a tremendous job looking into this. Given that many companies (and persons) have a somewhat significant volume of resources on vsphere, its good to figure out the reason for this. We’ll definetly share our findings with the rest of the community and pursue this deeper.


#14

I think @rick_h is on to something, I remember back in December 2018 that we drove our vCenter admins insane by flooding their management console with invalid logins as our passwords used in vCenter were changed but we didn’t upload the new ones to the controller.

@pmatulis I filed a bug report here https://bugs.launchpad.net/juju/+bug/1831244 and tried to keep it way shorter than the first post in this thread.