Blocker deploying CDK on manual deployment

magicaltrout · 11 February 2019 13:58

Hi folks

This relates to my other WIP post but its blocked me up for 24 hours and hosed a working K8S cluster so I figure its worth splitting out.

I’ve got 2 servers, one for K8S master and one K8S worker.

I’ve wired up manual CDK installs before but when I charm upgraded my working one this also happened:

On the Master the API server fails to start, it tries to query:

http://127.0.0.1:8080/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?

but just timeouts and throws a crap load of warnings.

On the worker:

Feb 11 12:52:03 ubuntu kube-proxy.daemon[12420]: cat: /var/snap/kube-proxy/722/args: No such file or directory
Feb 11 12:52:03 ubuntu kube-proxy.daemon[12420]: W0211 12:52:03.493741   12420 server.go:194] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
Feb 11 12:52:03 ubuntu kube-proxy.daemon[12420]: I0211 12:52:03.500258   12420 server.go:429] Neither kubeconfig file nor master URL was specified. Falling back to in-cluster config.
Feb 11 12:52:03 ubuntu kube-proxy.daemon[12420]: F0211 12:52:03.500271 12420 server.go:377] unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined

and fails to start.

I’m not sure which is the problem, but if I look at a working CDK I deployed into AWS, I see

/root/cdk/kubelet/config.yaml

which doesn’t exist on mine, and also in the error the args file/folder certainly doesn’t exist in the snap, but i’m not sure what creates it.

On the master the syslog also says:

Feb 11 12:52:19 ubuntu kube-apiserver.daemon[5177]: Error: error creating self-signed certificates: mkdir /var/run/kubernetes: permission denied

right back at the top but that folder doesn’t exist on the working cluster either, so i’m not convinced its a disaster

cory_fu · 11 February 2019 14:54

At first glance, the cert error seems like a red herring due to it trying to start up with default config due to the config file not having been written. The fact that it tried to start at all without having written the config file is an issue, but it’s not the underlying problem.

I saw something that presented similarly last week which was due to Flannel not being able to talk to etcd due to issues with the fan network, but there are any number of reasons why it might do that. First step to debug, I think, would be to juju run --unit kubernetes-master/0 -- charms.reactive get_flags and compare that to the flags required to run start_master for that revision of master.

cory_fu · 11 February 2019 14:57

Of note: the revision of the charms you’re using seems to be stable, which doesn’t seem to have been updated since 2018-12-10T19:38:54.101Z so I’m not sure why it would have suddenly stopped working when it did previously. Perhaps something changed in the environment to which you’re deploying that uncovered a latent bug.

magicaltrout · 11 February 2019 15:46

Thanks @cory_fu

apt.installed.socat
apt.installed.vaultlocker
authentication.setup
cdk-service-kicker.installed
certificates.available
certificates.batch.cert.available
certificates.ca.available
certificates.certs.available
certificates.client.cert.available
certificates.server.cert.available
certificates.server.certs.available
client.password.initialised
cni.available
cni.configured
cni.connected
endpoint.certificates.changed
endpoint.certificates.changed.ca
endpoint.certificates.changed.client.cert
endpoint.certificates.changed.client.key
endpoint.certificates.changed.egress-subnets
endpoint.certificates.changed.ingress-address
endpoint.certificates.changed.kubernetes-master_0.server.cert
endpoint.certificates.changed.kubernetes-master_0.server.key
endpoint.certificates.changed.private-address
endpoint.certificates.joined
endpoint.cni.changed.cidr
endpoint.cni.changed.egress-subnets
endpoint.cni.changed.ingress-address
endpoint.cni.changed.private-address
endpoint.cni.joined
endpoint.kube-api-endpoint.changed
endpoint.kube-api-endpoint.changed.egress-subnets
endpoint.kube-api-endpoint.changed.ingress-address
endpoint.kube-api-endpoint.changed.private-address
endpoint.kube-api-endpoint.joined
etcd.available
etcd.connected
etcd.tls.available
kube-api-endpoint.available
kube-control.connected
kubernetes-master.cluster-tag-sent
kubernetes-master.components.started
kubernetes-master.privileged
kubernetes-master.snaps.installed
leadership.is_leader
leadership.set./root/cdk/basic_auth.csv
leadership.set./root/cdk/known_tokens.csv
leadership.set./root/cdk/serviceaccount.key
leadership.set.auto_storage_backend
leadership.set.cluster_tag
leadership.set.snapd_refresh
snap.installed.cdk-addons
snap.installed.core
snap.installed.kube-apiserver
snap.installed.kube-controller-manager
snap.installed.kube-proxy
snap.installed.kube-scheduler
snap.installed.kubectl
snap.refresh.set
tls_client.ca.saved
tls_client.ca.written
tls_client.ca_installed
tls_client.client.certificate.saved
tls_client.client.certificate.written
tls_client.client.key.saved
tls_client.server.certificate.saved
tls_client.server.key.saved

looks all up and running to me.

magicaltrout · 11 February 2019 15:59

Also walking through that code, it seems to then hit configure_kubernetes_service

and snap get kube-apiserver seems set:

root@kubernetes-master1:~# snap get kube-apiserver
Key                                 Value
admission-control                   NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota
advertise-address                   10.0.0.15
allow-privileged                    true
audit-log-maxbackup                 9
audit-log-maxsize                   100
audit-log-path                      /root/cdk/audit/audit.log
audit-policy-file                   /root/cdk/audit/audit-policy.yaml
authorization-mode                  AlwaysAllow
basic-auth-file                     /root/cdk/basic_auth.csv
client-ca-file                      /root/cdk/ca.crt
enable-aggregator-routing           true
etcd-cafile                         /root/cdk/etcd/client-ca.pem
etcd-certfile                       /root/cdk/etcd/client-cert.pem
etcd-keyfile                        /root/cdk/etcd/client-key.pem
etcd-servers                        https://10.0.0.15:2379
insecure-bind-address               127.0.0.1
insecure-port                       8080
kubelet-certificate-authority       /root/cdk/ca.crt
kubelet-client-certificate          /root/cdk/client.crt
kubelet-client-key                  /root/cdk/client.key
kubelet-preferred-address-types     [InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP]
logtostderr                         true
min-request-timeout                 300
proxy-client-cert-file              /root/cdk/client.crt
proxy-client-key-file               /root/cdk/client.key
requestheader-allowed-names         client
requestheader-client-ca-file        /root/cdk/ca.crt
requestheader-extra-headers-prefix  X-Remote-Extra-
requestheader-group-headers         X-Remote-Group
requestheader-username-headers      X-Remote-User
service-account-key-file            /root/cdk/serviceaccount.key
service-cluster-ip-range            10.152.183.0/24
storage-backend                     etcd3
tls-cert-file                       /root/cdk/server.crt
tls-private-key-file                /root/cdk/server.key
token-auth-file                     /root/cdk/known_tokens.csv
v                                   4

cory_fu · 11 February 2019 16:22

I misread the startup missing config errors as also being from the master, but we actually need to be looking at the flags on the worker and start_worker.

magicaltrout · 11 February 2019 20:07

Hmmm thanks @cory_fu sorry for the delay had dad duty…

So, kube-control.dns.available isn’t set. But then I read somewhere on a single node on the first boot its not set because the dns is internal or something…

magicaltrout · 11 February 2019 20:08

Oh yeah 330 of the same file

magicaltrout · 11 February 2019 20:19

I don’t even see where that state is set, I’m missing something somewhere.

magicaltrout · 11 February 2019 20:24

Oh, so the dns comes in from the master according to the relation doc(I think).

So again we’re back to why the timeouts occur on the kube-apiserver.daemon…

magicaltrout · 11 February 2019 21:05

I’ve also set flannel to bind to the internal network interface, and etcd is set to bind to all.

magicaltrout · 11 February 2019 22:01

aaah crapola, etcd out of the box is v2 but the bundle deploys v3, didn’t catch that. Okay, redeploy for the 600th time…

magicaltrout · 11 February 2019 22:56

Hurrah, mystery solved.

So etcd installs 2.x by default currently, but the K8S charms only work with v3, so in the bundle the snap version was set, but as I was hand cranking it I wasn’t picking up the version diff.

Thanks for the help @cory_fu

cory_fu · 11 February 2019 23:33

Glad you figured it out. Seems like the k8s charms should detect and report that better.