Waiting for kubelet,kube-proxy to start

ellensen · 8 September 2019 14:30

I have some trouble with scaling my kubernetes cluster with new worker nodes.
All nodes I have added is stuck with message “Waiting for kubelet,kube-proxy to start” and status waiting for hours now.

I’m using juju 2.6.8, and the vsphere provider.
I have updated all charms in the model to the latest version.

App                    Version  Status       Scale  Charm                  Store       Rev  OS      Notes
containerd                      active           5  containerd             jujucharms   20  ubuntu
easyrsa                3.0.1    active           1  easyrsa                jujucharms  270  ubuntu
etcd                   3.2.10   active           3  etcd                   jujucharms  434  ubuntu
flannel                0.10.0   active           7  flannel                jujucharms  438  ubuntu
kubeapi-load-balancer  1.14.0   maintenance      1  kubeapi-load-balancer  jujucharms  649  ubuntu  exposed
kubernetes-master      1.15.3   active           2  kubernetes-master      jujucharms  724  ubuntu
kubernetes-worker      1.15.3   waiting          5  kubernetes-worker      jujucharms  571  ubuntu  exposed
vsphere-integrator              active           1  vsphere-integrator     jujucharms    2  ubuntu

Show status from the unit says

Time                   Type       Status       Message
08 Sep 2019 13:18:51Z  juju-unit  executing    running container-runtime-relation-joined hook
08 Sep 2019 13:19:07Z  workload   maintenance  Unpacking cni resource.
08 Sep 2019 13:19:12Z  juju-unit  executing    running kube-api-endpoint-relation-changed hook
08 Sep 2019 13:19:33Z  juju-unit  executing    running cni-relation-joined hook
08 Sep 2019 13:19:49Z  juju-unit  executing    running kube-control-relation-joined hook
08 Sep 2019 13:20:05Z  juju-unit  executing    running cni-relation-changed hook
08 Sep 2019 13:20:21Z  juju-unit  executing    running certificates-relation-changed hook
08 Sep 2019 13:20:39Z  juju-unit  executing    running cni-relation-changed hook
08 Sep 2019 13:21:01Z  juju-unit  executing    running container-runtime-relation-changed hook
08 Sep 2019 13:21:19Z  juju-unit  executing    running kube-control-relation-changed hook
08 Sep 2019 13:21:38Z  juju-unit  idle
08 Sep 2019 13:33:01Z  juju-unit  executing    running config-changed hook
08 Sep 2019 13:33:32Z  juju-unit  idle
08 Sep 2019 13:43:29Z  juju-unit  executing    running leader-settings-changed hook
08 Sep 2019 13:43:44Z  juju-unit  idle
08 Sep 2019 13:48:08Z  juju-unit  executing    running config-changed hook
08 Sep 2019 13:48:21Z  workload   waiting      Waiting for kubelet,kube-proxy to start.
08 Sep 2019 13:48:21Z  juju-unit  idle
08 Sep 2019 13:48:52Z  workload   active       Kubernetes worker running.
08 Sep 2019 14:19:36Z  workload   waiting      Waiting for kubelet,kube-proxy to start.

when I ssh into the new machine it says:

ubuntu@juju-570578-30:~$ ps aux | grep kubelet
ubuntu    3295  0.0  0.0  14856  1076 pts/0    S+   14:21   0:00 grep --color=auto kubelet
ubuntu@juju-570578-30:~$

when on the old machine it says

ubuntu   11834  0.0  0.0  14856  1008 pts/0    S+   14:24   0:00 grep --color=auto kubelet
root     14571  4.0  0.5 2157668 98508 ?       Ssl  Sep01 443:30 /snap/kubelet/1179/kubelet --cloud-provider=vsphere --config=/root/cdk/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock --dynamic-config-dir=/root/cdk/kubelet/dynamic-config --config=/root/cdk/kubelet/config.yaml --kubeconfig=/root/cdk/kubeconfig --logtostderr --network-plugin=cni --node-ip=10.0.50.228 --pod-infra-container-image=image-registry.canonical.com:5000/cdk/pause-amd64:3.1 --provider-id=vsphere://42251125-CA33-13AB-8677-7CB280A16504 --v=0

Seems like kubelet doesnt start at all…
Seems like the folder /root/cdk/kubelet in the machine does not exist at all, and no config files related to kubelet in /root/cdk is found.

I presume there should have been some thing related to kubelet in this directory.

root@juju-570578-30:~/cdk# ls -la
total 36
drwxrwx--- 2 root root 4096 Sep  8 13:20 .
drwx------ 6 root root 4096 Sep  8 14:28 ..
-r--r----- 1 root root 1172 Sep  8 13:16 ca.crt
-rw-r--r-- 1 root root 4380 Sep  8 13:20 client.crt
-rw-r--r-- 1 root root 1704 Sep  8 13:20 client.key
-rw-r--r-- 1 root root 4594 Sep  8 13:20 server.crt
-rw-r--r-- 1 root root 1704 Sep  8 13:20 server.key
root@juju-570578-30:~/cdk# pwd
/root/cdk

Anyone has the same problems?
Daniel

knobby · 9 September 2019 02:29

First I would check to see if the snap is installed with snap list and then assuming it is installed, I would check the log files for the snap, journalctl -u snap.kubelet.daemon.service , and the juju worker logs, juju debug-log -i kubernetes-worker/x --replay.

With that information we can start to figure out what happened here.

ellensen · 9 September 2019 06:02

Snap is installed

ubuntu@juju-570578-31:~$ sudo snap list|grep kubelet
kubelet     1.15.3   1179  1.15      canonical*  classic

ellensen · 9 September 2019 06:03

systemd service crashes with this log

Sep 08 15:48:30 juju-570578-31 systemd[1]: Started Service for snap application kubelet.daemon.
Sep 08 15:48:30 juju-570578-31 kubelet.daemon[27014]: cat: /var/snap/kubelet/1179/args: No such file or directory
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.184897   27014 server.go:425] Version: v1.15.3
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.185292   27014 plugins.go:103] No cloud provider specified.
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: W0908 15:48:34.185316   27014 server.go:564] standalone mode, no API client
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: W0908 15:48:34.224276   27014 server.go:482] No api server defined - no events will be sent to API server.
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.224329   27014 server.go:661] --cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.224879   27014 container_manager_linux.go:270] container manager verified user specified cgroup-root exists: []
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.224905   27014 container_manager_linux.go:275] Creating Container Manager object based on Node Config: {RuntimeCgroupsName: SystemCgroupsName: Kube
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.225067   27014 container_manager_linux.go:295] Creating device plugin manager: true
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.225214   27014 state_mem.go:36] [cpumanager] initializing new in-memory state store
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.244406   27014 client.go:75] Connecting to docker on unix:///var/run/docker.sock
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.244460   27014 client.go:104] Start docker client with request timeout=2m0s
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: F0908 15:48:34.248332   27014 server.go:273] failed to run Kubelet: failed to create kubelet: failed to get docker version: Cannot connect to the Docker daemon at
Sep 08 15:48:34 juju-570578-31 systemd[1]: snap.kubelet.daemon.service: Main process exited, code=exited, status=255/n/a
Sep 08 15:48:34 juju-570578-31 systemd[1]: snap.kubelet.daemon.service: Failed with result 'exit-code'.
Sep 08 15:48:34 juju-570578-31 systemd[1]: snap.kubelet.daemon.service: Service hold-off time over, scheduling restart.

ellensen · 9 September 2019 06:52

I have added the output from juju debug-log -i kubernetes-worker/x --replay here

I dont see any other obvious errors than the config-files and directories for kubelet has not been created somehow.

tvansteenburgh · 13 September 2019 21:04

@ellensen I see you have containerd in your deployment. Can you confirm that containerd is related to both kubernetes-worker and kubernetes-master?

ellensen · 14 September 2019 14:00

@tvansteenburgh containerd was only related to the workers. should it be related to the masters as well?
this cluster is a pre-containerd cluster created from kubernetes 1.14 something, and the upgraded to 1.15 and containerd manually later.

tvansteenburgh · 14 September 2019 14:36

@ellensen Yep, see the upgrade notes here for reference: Upgrade notes | Ubuntu

ellensen · 14 September 2019 20:10

followed the upgrade notes,

added docker,
added relations to workers and masters and removed it to clean up docker.
then added containerd relations to workers and nodes

but still no change, the kubelet config and files under /root/cdk/kubelet is not created and kubelet is not running on the node.

tried to uninstall containerd application, but all nodes now fail with " containerd/23 error idle 10.0.50.246 hook failed: “containerd-relation-departed”
" instead

tvansteenburgh · 16 September 2019 13:43

At this point I think we need a debug tarball in order to diagnose. Please file a bug at https://bugs.launchpad.net/charm-kubernetes-worker/+filebug and be sure to follow the juju-crashdump instructions under the “Kubernetes Worker Charm bug reporting guidelines” section.

ellensen · 2 November 2019 12:45

Sorry for no feedback, I think the problem I had was from a host with network issues, somehow one of the ESX hosts had a very unstable network connection. Tried to deploy new nodes from scratch but they also stopped on various network connection refused messages… Then i noticed that one new node actually managed to deploy quickly and initialize, and that node was provisioned on the working esx node. I disabled the problematic esx host and migrated all the vms to the working and since then it has been working.