Waiting for kubelet,kube-proxy to start

help-needed

#1

I have some trouble with scaling my kubernetes cluster with new worker nodes.
All nodes I have added is stuck with message “Waiting for kubelet,kube-proxy to start” and status waiting for hours now.

I’m using juju 2.6.8, and the vsphere provider.
I have updated all charms in the model to the latest version.

App                    Version  Status       Scale  Charm                  Store       Rev  OS      Notes
containerd                      active           5  containerd             jujucharms   20  ubuntu
easyrsa                3.0.1    active           1  easyrsa                jujucharms  270  ubuntu
etcd                   3.2.10   active           3  etcd                   jujucharms  434  ubuntu
flannel                0.10.0   active           7  flannel                jujucharms  438  ubuntu
kubeapi-load-balancer  1.14.0   maintenance      1  kubeapi-load-balancer  jujucharms  649  ubuntu  exposed
kubernetes-master      1.15.3   active           2  kubernetes-master      jujucharms  724  ubuntu
kubernetes-worker      1.15.3   waiting          5  kubernetes-worker      jujucharms  571  ubuntu  exposed
vsphere-integrator              active           1  vsphere-integrator     jujucharms    2  ubuntu

Show status from the unit says

Time                   Type       Status       Message
08 Sep 2019 13:18:51Z  juju-unit  executing    running container-runtime-relation-joined hook
08 Sep 2019 13:19:07Z  workload   maintenance  Unpacking cni resource.
08 Sep 2019 13:19:12Z  juju-unit  executing    running kube-api-endpoint-relation-changed hook
08 Sep 2019 13:19:33Z  juju-unit  executing    running cni-relation-joined hook
08 Sep 2019 13:19:49Z  juju-unit  executing    running kube-control-relation-joined hook
08 Sep 2019 13:20:05Z  juju-unit  executing    running cni-relation-changed hook
08 Sep 2019 13:20:21Z  juju-unit  executing    running certificates-relation-changed hook
08 Sep 2019 13:20:39Z  juju-unit  executing    running cni-relation-changed hook
08 Sep 2019 13:21:01Z  juju-unit  executing    running container-runtime-relation-changed hook
08 Sep 2019 13:21:19Z  juju-unit  executing    running kube-control-relation-changed hook
08 Sep 2019 13:21:38Z  juju-unit  idle
08 Sep 2019 13:33:01Z  juju-unit  executing    running config-changed hook
08 Sep 2019 13:33:32Z  juju-unit  idle
08 Sep 2019 13:43:29Z  juju-unit  executing    running leader-settings-changed hook
08 Sep 2019 13:43:44Z  juju-unit  idle
08 Sep 2019 13:48:08Z  juju-unit  executing    running config-changed hook
08 Sep 2019 13:48:21Z  workload   waiting      Waiting for kubelet,kube-proxy to start.
08 Sep 2019 13:48:21Z  juju-unit  idle
08 Sep 2019 13:48:52Z  workload   active       Kubernetes worker running.
08 Sep 2019 14:19:36Z  workload   waiting      Waiting for kubelet,kube-proxy to start.

when I ssh into the new machine it says:

ubuntu@juju-570578-30:~$ ps aux | grep kubelet
ubuntu    3295  0.0  0.0  14856  1076 pts/0    S+   14:21   0:00 grep --color=auto kubelet
ubuntu@juju-570578-30:~$

when on the old machine it says

ubuntu   11834  0.0  0.0  14856  1008 pts/0    S+   14:24   0:00 grep --color=auto kubelet
root     14571  4.0  0.5 2157668 98508 ?       Ssl  Sep01 443:30 /snap/kubelet/1179/kubelet --cloud-provider=vsphere --config=/root/cdk/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock --dynamic-config-dir=/root/cdk/kubelet/dynamic-config --config=/root/cdk/kubelet/config.yaml --kubeconfig=/root/cdk/kubeconfig --logtostderr --network-plugin=cni --node-ip=10.0.50.228 --pod-infra-container-image=image-registry.canonical.com:5000/cdk/pause-amd64:3.1 --provider-id=vsphere://42251125-CA33-13AB-8677-7CB280A16504 --v=0

Seems like kubelet doesnt start at all…
Seems like the folder /root/cdk/kubelet in the machine does not exist at all, and no config files related to kubelet in /root/cdk is found.

I presume there should have been some thing related to kubelet in this directory.

root@juju-570578-30:~/cdk# ls -la
total 36
drwxrwx--- 2 root root 4096 Sep  8 13:20 .
drwx------ 6 root root 4096 Sep  8 14:28 ..
-r--r----- 1 root root 1172 Sep  8 13:16 ca.crt
-rw-r--r-- 1 root root 4380 Sep  8 13:20 client.crt
-rw-r--r-- 1 root root 1704 Sep  8 13:20 client.key
-rw-r--r-- 1 root root 4594 Sep  8 13:20 server.crt
-rw-r--r-- 1 root root 1704 Sep  8 13:20 server.key
root@juju-570578-30:~/cdk# pwd
/root/cdk

Anyone has the same problems?
Daniel


#2

First I would check to see if the snap is installed with snap list and then assuming it is installed, I would check the log files for the snap, journalctl -u snap.kubelet.daemon.service, and the juju worker logs, juju debug-log -i kubernetes-worker/x --replay.

With that information we can start to figure out what happened here.


#3

Snap is installed

ubuntu@juju-570578-31:~$ sudo snap list|grep kubelet
kubelet     1.15.3   1179  1.15      canonical*  classic

#4

systemd service crashes with this log

Sep 08 15:48:30 juju-570578-31 systemd[1]: Started Service for snap application kubelet.daemon.
Sep 08 15:48:30 juju-570578-31 kubelet.daemon[27014]: cat: /var/snap/kubelet/1179/args: No such file or directory
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.184897   27014 server.go:425] Version: v1.15.3
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.185292   27014 plugins.go:103] No cloud provider specified.
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: W0908 15:48:34.185316   27014 server.go:564] standalone mode, no API client
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: W0908 15:48:34.224276   27014 server.go:482] No api server defined - no events will be sent to API server.
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.224329   27014 server.go:661] --cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.224879   27014 container_manager_linux.go:270] container manager verified user specified cgroup-root exists: []
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.224905   27014 container_manager_linux.go:275] Creating Container Manager object based on Node Config: {RuntimeCgroupsName: SystemCgroupsName: Kube
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.225067   27014 container_manager_linux.go:295] Creating device plugin manager: true
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.225214   27014 state_mem.go:36] [cpumanager] initializing new in-memory state store
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.244406   27014 client.go:75] Connecting to docker on unix:///var/run/docker.sock
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: I0908 15:48:34.244460   27014 client.go:104] Start docker client with request timeout=2m0s
Sep 08 15:48:34 juju-570578-31 kubelet.daemon[27014]: F0908 15:48:34.248332   27014 server.go:273] failed to run Kubelet: failed to create kubelet: failed to get docker version: Cannot connect to the Docker daemon at
Sep 08 15:48:34 juju-570578-31 systemd[1]: snap.kubelet.daemon.service: Main process exited, code=exited, status=255/n/a
Sep 08 15:48:34 juju-570578-31 systemd[1]: snap.kubelet.daemon.service: Failed with result 'exit-code'.
Sep 08 15:48:34 juju-570578-31 systemd[1]: snap.kubelet.daemon.service: Service hold-off time over, scheduling restart.

#5

I have added the output from juju debug-log -i kubernetes-worker/x --replay here

I dont see any other obvious errors than the config-files and directories for kubelet has not been created somehow.


#6

@ellensen I see you have containerd in your deployment. Can you confirm that containerd is related to both kubernetes-worker and kubernetes-master?


#7

@tvansteenburgh containerd was only related to the workers. should it be related to the masters as well?
this cluster is a pre-containerd cluster created from kubernetes 1.14 something, and the upgraded to 1.15 and containerd manually later.


#8

@ellensen Yep, see the upgrade notes here for reference: Upgrade notes | Charmed Distribution of Kubernetes documentation | Ubuntu


#9

followed the upgrade notes,

  • added docker,
  • added relations to workers and masters and removed it to clean up docker.
  • then added containerd relations to workers and nodes

but still no change, the kubelet config and files under /root/cdk/kubelet is not created and kubelet is not running on the node.

tried to uninstall containerd application, but all nodes now fail with " containerd/23 error idle 10.0.50.246 hook failed: “containerd-relation-departed”
" instead :frowning:


#10

At this point I think we need a debug tarball in order to diagnose. Please file a bug at https://bugs.launchpad.net/charm-kubernetes-worker/+filebug and be sure to follow the juju-crashdump instructions under the “Kubernetes Worker Charm bug reporting guidelines” section.