Controller for vsphere-cloud only uses ESXi host it was bootstrapped on

help-needed
vsphere

#1

Hello,

I admit it is not a typical VMWare setup, however I believe it is still valid configuration. Current POC environment has 2 ESXi hosts with local storage running as separate hosts within datacenter01 container. The end plan is to have 2 large ESXi hosts with local storage to provide it to cluster and one smaller ESXi to run cluster management stuff to ensure cluster “majority rule”. At least Kubernetes deployment via Ansible and Kubespray on top of that worked, however it was decided to migrate to Juju deployment to simplify tooling for maintenance and support teams.

Juju deployment happily working in another environment with shared SAN storage available. But not with local only storage.

I have successfully added Vsphere cloud and bootstrapped Juju controller after I have explicitly specified datastore for controller via " --config datastore="srvt027_Local-01" option. Without it, bootstrap process was failing with an error cannot start bootstrap instance: creating template VM: no datastore provided and multiple available: "srvt027_Local-01, srvt028_Local-01". So this shows that Juju is able to see both nodes and their configuration.

I was able to start Kubernetes deployment, however Juju only deployed VM’s to srvt027 host. If I would bootstrap controller on srvt028 host, then everything is started on that host.

I have tried the following (having that controller is bootstrapped on 028, with IP x.y.z.25):

  • Specify constraints: cores=2 mem=4G root-disk=16G root-disk-source=srvt027_Local-01 in the Kubernetes bundle.yaml.
    • Juju happily ignored root-disk-source constraint and deployed everything on srvt028.
  • juju add-machine zone="x.y.z.24"
    • Juju reporting an error: “suitable availability zone for machine 14 not found”
  • juju add-machine zone="x.y.z.24" --constraints root-disk-source="srvt027_Local-01"
    • Juju reporting an error: “suitable availability zone for machine 15 not found”
  • juju add-machine --constraints root-disk-source="srvt027_Local-01"
    • Juju happily ignored root-disk-source constraint and deployed machine on srvt028.
  • Add both nodes to VMWare cluster object. No effect.
  • Migrated controller to another host. No effect, VM’s created on the same host controller was bootstrapped on.

Questions:

  • What is proper syntax to define explicit hosts for vSphere cloud? Documentation for vSphere role is a bit vague. What are mycluster, mygroup, myparent in VSphere terms?

juju deploy myapp --constraints zones=mycluster/mygroup
juju deploy myapp --constraints zones=mycluster/myparent/mygroup

  • Does anyone have used Juju properly with ESXi hosts without any shared storage? Is this something, that Juju is not able to handle by design?

#2

Juju manages instances via the vSphere API. It does not work at the ESXi level. If vSAN support is not enabled, I don’t believe that vSphere would enable us to know that the local a local disk is actually shared between multiple vSphere datacenters.


I believe that a “vsphere cloud” in Juju limits itself to deploying VMs to a single datastore. For technical reasons, Juju uses a template VM that is cloned when new instances are requested. That is why moving the controller doesn’t affect where the new instances are created.

One option available to you is to create a second cloud. Applications can still communicate across the two clouds via cross-model relations (CMR).


This is concerning. It certainly looks like you’ve found a bug. (Possibly in our documentation!) From memory, the root disk constraint doesn’t affect where the instance is deployed to. It affects the size of the “local” disk within the VM.


#3

Hello,

thanks for reply.

Meaning? If datastore is not explicitly listed during bootstrap, Juju fails with an error, listing all (both) available datastores. So it seems to have an access to that info. There is single datacenter object in VCenter and it should be possible to select host for application VM deployment the same way as it was done for bootstrap. At least I would expect that.

I would presume you were referring to bootstrapping as further application deployments using Charm bundle are deployed to multiple hosts if vSAN is available.

Thanks for the suggestion. I will try that.

For vSAN enabled environment it worked, Juju was juggling VM’s across available hosts. Maybe it was some side effect, which have triggered allocation.


#4

Hi @elvinas - just checking that I understand your setup: you have a datacenter with two hosts, srvt027 and srvt028 and two datastores srvt027_Local-01 and srvt028_Local-01. Is that right?

I’m not sure about this, but I don’t think that datastores in vsphere are restricted to VMs on a specific host. You can have a VM based on one host talking to a datastore that happens to be based on the other host.

So it might be that the commands are honouring the root-disk-source constraint (storing the data for the VM on the correct datastore) even though they’re deploying the VM on the other host. Can you check where the disk lives in the vsphere UI? If it’s on the wrong datastore that’s definitely a bug.

I think you’re right that the zones examples are confusing. Juju availability zones were originally designed to work with clouds like AWS and Azure, so they don’t line up exactly with vsphere’s hosts, clusters and resource groups (particularly in terms of distributing units across zones).

When using a vsphere cloud the first component of a zone is the name of a host or cluster. Any names after the first / are resource groups - there might be multiple slashes because resource groups can be nested.

If you’re not trying to use resource groups to partition the hosts, you can just use the host name as the zone name. So to put a machine on srvt027 and its disk on the associated datastore you’d use:

--constraints "zones=srvt027 root-disk-source=srvt027_Local-01"

Can you try that and let us know if it does the right thing?


#5

The only caveat (which I suspect is the key to my problem) is that srvt027 see only srvt027_Local-01 datastore and srvt28 can access only srvt028_Local-01, as these are local disk arrays on ESXi hosts. So I doubt it is possible to have VM on srvt027, with datastore on other host if datastore is not shared. At least VCenter does not allow to migrate compute resource or storage host only, I can only migrate both to another host.

If I try to specify host IP as a constrain, deployment fails with “suitable availability zone for machine ** not found”. I did not try specify host name as there are no DNS there yet.

Just tried, my manually adding hostname resolution. Machine deployment only works if I do specify IP address I have bootstrapped juju controller.

debadmin@debian-srv:~$ juju add-machine zone=srvt027
failed to create 1 machine
ERROR cannot add a new machine: availability zone "srvt027" not found
debadmin@debian-srv:~$ juju add-machine zone=srvt028
failed to create 1 machine
ERROR cannot add a new machine: availability zone "srvt028" not found
debadmin@debian-srv:~$ juju add-machine zone=x.y.z.24
created machine 18 
debadmin@debian-srv:~$ juju add-machine zone=x.y.z.25
failed to create 1 machine
ERROR cannot add a new machine: availability zone "x.y.z.25" not found

However even if adding machine succeeded, it does not start:

18 down pending bionic suitable availability zone for machine 18 not found

Adding more tries:

debadmin@debian-srv:~$ juju add-machine --constraints zones=srvt027
failed to create 1 machine
ERROR cannot add a new machine: availability zone "srvt027" not found
debadmin@debian-srv:~$ juju add-machine --constraints zones=srvt028
failed to create 1 machine
ERROR cannot add a new machine: availability zone "srvt028" not found
debadmin@debian-srv:~$ juju add-machine --constraints zones=x.y.z.25
failed to create 1 machine
ERROR cannot add a new machine: availability zone "x.y.z.25" not found
debadmin@debian-srv:~$ juju add-machine --constraints zones=x.y.z.24
.created machine 19
debadmin@debian-srv:~$ juju add-machine --constraints root-disk-source=srvt027_Local-01
created machine 20
debadmin@debian-srv:~$ juju add-machine --constraints root-disk-source=srvt028_Local-01
created machine 21
debadmin@debian-srv:~$ juju status
Model       Controller      Cloud/Region        Version    SLA          Timestamp
kubernetes  poc-controller  VIPOC/datacenter01  2.7-beta1  unsupported  08:42:32+02:00
...
20       pending       juju-1d2fa3-20  bionic      poweredOn
21       pending       juju-1d2fa3-21  bionic      poweredOn

However both 20 and 21 machine have been created on a srvt028 host with srvt028_Local-01.


#6

Hm… I have read documentation and I wonder how that would work.

VSphere cloud is linked to a VCenter host and datacenter object, not specific ESXi host. So if I add additional cloud definition with same VCenter hosts this do not change situation if I use same Juju controller. If I add new controller then I still would be limited to one application per controller, as per CMR scenario 2. This would not be intended way to deploy an app as I would loose resiliency in case of host failure.

Or am I missing something?


#7

Ok - it sounds like I was wrong about this, sorry. The vsphere instance I have access to for testing only has one host so I wasn’t able to try it.

What are the hosts named in the vsphere UI? When finding the resource group to create a VM in we match the name of the host in the inventory, not the IP address or the DNS name (unless that happens to also be the inventory name). So I wouldn’t expect that the commands specifying zones=x.y.z.24 would do what you want.

In general, you should be able to create machines on both hosts with only one controller, no cross-model relations needed.


#8

Hosts in vSphere named via their IP addresses. Now they are part of the cluster Cluster01, but previously there was no cluster combined and specifying zones=‘IP_ADDRESS’ did not work. Cant it be that “.” in IP address are interpreted other than literal symbol, when matching name?

Will try to reconfigure that, when I find login to DNS VM running in same VCenter. :slight_smile:


#9

Hi @elvinas - you were right, we weren’t respecting the root-disk-source constraint when creating VMs.

I’ve put up a PR to fix it - it should be released in 2.6.10 very soon.


#10

Thanks. Maybe it is already available in some 2.7 beta? As we have another issue with 2.6 as it does creates multiple network interfaces and then we have another set of problems. If I am not mistaken this was the issue: Bug #1800940 “Juju bootstrap vmware vsphere not working with vsa...” : Bugs : juju

Although I am not sure if that will solve my problem. Somehow I suspect, that Juju will not find host, when it will start respecting root-disk-source :slight_smile:


#11

Hi @babbageclunk was the fix applied to version 2.7-beta1? As today I have tried to redeploy environment and first I have failed due to wrong “VM Network” being selected instead of distributed switch one. So I have rebuilt everything, including the controller. Version is still same 2.7-beta1, but now “juju status” shows the errors as boostraping controller if root-disk-source is not specified.

As soon as I have specified root-disk-source for all VM’s and shuffling VM’s between two nodes, Juju have created them on expected hosts. So it looks like the issue was fixed in 2.7-beta1.

The strange thing is that juju now selects wrong network. I have found some bug dated to 2017, where it was said something hardcoded “VM network” and this issue supposed to be fixed.


#12

Hi @elvinas - yes, the root-disk-source fix was merged into the 2.7 branch on Friday so if you were using a new build for testing then it would have been there.

Have you set the primary-network (and possibly external-network) values in the model configuration? If not Juju might pick the wrong one.


#13

OK. Thanks for the confirmation. Although it would be nice if juju would report build number or something so it can be identified that something have changed.

Regarding primary network. Yes, I have specified. And I see “DSwitch-VM Network” in bootstrap config file:

debadmin@ep-jumpbox:~$ grep -i netw .local/share/juju/*
.local/share/juju/bootstrap-config.yaml: container-networking-method: “”
.local/share/juju/bootstrap-config.yaml: disable-network-management: false
.local/share/juju/bootstrap-config.yaml: primary-network: DSwitch-VM Network

However VM’s end up in “VM Network”. Last week it seem to have worked and application was installing. Now deployment continues as expected, when I manually migrate VM’s from “VM Network” to “DSwitch-VM Network”.

This one supposed to be fixed more than a year ago: Bug #1619812 “juju bootstrap node or deploy service with vSphere...” : Bugs : juju

Or do we have a case of:
_99 little bugs in the code, _
99 bugs in the code,
1 bug fixed…

Compile again,
100 little bugs in the code?
:smiley:


#14

Today retried with the following juju versions:

juju (2.6/edge) 2.6.10+2.6-9f8a13f
edge: 2.7-beta1+develop-4819cdc 2019-10-17

Controller is bootstrapped via:

juju bootstrap  VIPOC/datacenter01 poc-controller \
    --config primary-network="DSwitch-VM Network" \
    --config datastore="srvt028_Local-01" 

Behavior is the same. Controller is booted on correct network, however VM’s ends up in the “VM Network”, instead of “DSwitch-VM Network”.

In addition to that, deployment with 2.7-beta1+develop-4819cdc stalls with all hosts showing “agent initializing” and logs full of for each cluster node:

_unit-etcd-1: 14:24:09 DEBUG juju.worker.dependency "uniter" manifold worker stopped: subnet "x.y.z.0/24" not found_
_unit-etcd-1: 14:24:09 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: subnet "x.y.z.0/24" not found_

Earlier 2.7 build (3xxxxxx) did not have those.


#15

Hi @elvinas, sorry I missed your response! I think the problem might be that specifying --config applies to the controller model but not to any new models you create after that, so those are falling back to the default network. Could you try bootstrapping but setting the network and datastore with --model-default rather than --config?

The logging about missing subnets from the 2.7 edge version might be due to ongoing work that a couple of other team members are doing in that area - I’ll make sure they know about the problem. In the meantime it’s probably better to use the 2.6/edge snap.