NFS Charm + 1.15 Charmed Kube = NFS Version not supported


#1

Hi folks,

I have downgraded from Kubernetes 1.16 > 1.15 and used the Juju method for installation.

I deployed the NFS charm and associated it with the Kubernetes worker. However, although this worked flawlessly on 1.16, for some reason the setup seems to result in a NFS protocol mismatch on 1.15 - not sure if this is a bug in the NFS charm, or if there is an easy way to resolve it. (I imagine I just need to change the nfs provisioner definition somehow.)

From kubectl get events

LAST SEEN   TYPE      REASON        OBJECT                                        MESSAGE
5m54s       Normal    Scheduled     pod/nfs-client-provisioner-7497897b88-92lm7   Successfully assigned default/nfs-client-provisioner-7497897b88-92lm7 to juju-e52e83-4
5m53s       Warning   FailedMount   pod/nfs-client-provisioner-7497897b88-92lm7   MountVolume.SetUp failed for volume "nfs-client-root" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/68f581eb-2895-44ea-b902-b20ed3bdaef7/volumes/kubernetes.io~nfs/nfs-client-root --scope -- mount -t nfs 192.168.54.78:/srv/data/kubernetes-worker /var/lib/kubelet/pods/68f581eb-2895-44ea-b902-b20ed3bdaef7/volumes/kubernetes.io~nfs/nfs-client-root
Output: Running scope as unit: run-r911500b6fcbe43ed90b0da03b909b2d0.scope
mount.nfs: requested NFS version or transport protocol is not supported
5m53s       Warning   FailedMount   pod/nfs-client-provisioner-7497897b88-92lm7   MountVolume.SetUp failed for volume "nfs-client-root" : mount failed: exit status 32


#2

Actually I’m wondering if this is a bug in the version of the kubernetes-worker I used that’s included in charmed-kubernetes-270 for kube 1.15, as this seems to be an issue in the worker’s mount relation that is not present in latest…


#3

Tried removing NFS and re-adding it to the model with nfs options tcp,nfsvers=4 but that didn’t seem to change the problem on the worker side. Going to try this again by destroying the model and redeploying with these options the first time in case my changes just aren’t taking…


#4

Destroying and creating did not resolve the issue, and adding the mount options didn’t help either:

routhinator@andromeda:~$ kubectl get events
LAST SEEN   TYPE      REASON        OBJECT                                        MESSAGE
12m         Warning   FailedMount   pod/nfs-client-provisioner-7fbbc85766-spwpd   Unable to mount volumes for pod "nfs-client-provisioner-7fbbc85766-spwpd_default(6049df72-1131-4a8f-904e-71e01f86e556)": timeout expired waiting for volumes to attach or mount for pod "default"/"nfs-client-provisioner-7fbbc85766-spwpd". list of unmounted volumes=[nfs-client-root]. list of unattached volumes=[nfs-client-root default-token-qh2g5]
3m59s       Warning   FailedMount   pod/nfs-client-provisioner-7fbbc85766-spwpd   (combined from similar events): MountVolume.SetUp failed for volume "nfs-client-root" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/6049df72-1131-4a8f-904e-71e01f86e556/volumes/kubernetes.io~nfs/nfs-client-root --scope -- mount -t nfs 192.168.54.189:/srv/data/kubernetes-worker /var/lib/kubelet/pods/6049df72-1131-4a8f-904e-71e01f86e556/volumes/kubernetes.io~nfs/nfs-client-root
Output: Running scope as unit: run-r34da98346d734fb29dc54d4ca55d8b03.scope
mount.nfs: requested NFS version or transport protocol is not supported


#5

Combing through the code on the kubernetes-worker, I see the deployment for the nfs provisioner doesn’t include any mount options passing, even in the latest version… So I guess I’m barking up the wrong tree with the client mount options. Still really confused about this error. I’ll start digging into the code for the NFS charm and see if I can find anything. It certainly seems (based on the error) that the provisioner is requesting nfsv3 from a nfs4 server…


#6

Doh, I’ve been barking up the wrong tree. Looks like the Canonical docs should mention this gotcha for LXD deployments that’s mentioned on the Github page for the NFS charm, but I missed it initially.

Oddly, before I wiped my 1.16 install and downgraded, I did not have to do this last time:

On the LXC host:

apt-get install nfs-common
modprobe nfsd
mount -t nfsd nfsd /proc/fs/nfsd

Edit /etc/apparmor.d/lxc/lxc-default and add the following three lines to it:

mount fstype=nfs,
mount fstype=nfs4,
mount fstype=nfsd,
mount fstype=rpc_pipefs,

after which:

sudo /etc/init.d/apparmor restart

Finally:

juju deploy nfs

I’m thinking this is what made this suddenly fail, though what has changed from last time I deployed to this, I cannot imagine.

However I can definitely see that the nfs server is failing to start in the LXC container:

root@juju-f9675b-5:/var/log/juju# journalctl -xe
-- 
-- Unit nfs-idmapd.service has failed.
-- 
-- The result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: nfs-idmapd.service: Job nfs-idmapd.service/start failed with result 'dependency'.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: Dependency failed for NFS Mount Daemon.
-- Subject: Unit nfs-mountd.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit nfs-mountd.service has failed.
-- 
-- The result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: nfs-mountd.service: Job nfs-mountd.service/start failed with result 'dependency'.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: nfs-server.service: Job nfs-server.service/start failed with result 'dependency'.
Oct 08 22:14:29 juju-f9675b-5 mount[542]: mount: /run/rpc_pipefs: permission denied.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: Started Preprocess NFS configuration.
-- Subject: Unit nfs-config.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit nfs-config.service has finished starting up.
-- 
-- The start-up result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: run-rpc_pipefs.mount: Mount process exited, code=exited status=32
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: run-rpc_pipefs.mount: Failed with result 'exit-code'.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: Failed to mount RPC Pipe File System.
-- Subject: Unit run-rpc_pipefs.mount has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit run-rpc_pipefs.mount has failed.
-- 
-- The result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: Dependency failed for RPC security service for NFS client and server.
-- Subject: Unit rpc-gssd.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit rpc-gssd.service has failed.
-- 
-- The result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: rpc-gssd.service: Job rpc-gssd.service/start failed with result 'dependency'.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: Dependency failed for RPC security service for NFS server.
-- Subject: Unit rpc-svcgssd.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit rpc-svcgssd.service has failed.
-- 
-- The result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: rpc-svcgssd.service: Job rpc-svcgssd.service/start failed with result 'dependency'.

I didn’t think to look at this at all as the Juju interface reported that all was well with the NFS unit… I guess it doesn’t actually check that the nfs service is operational.


#7

Oddly, even with all the advice applied, and the mount permission applied to all variants of the lxd apparmor profiles, restarting the contianers, the whole server… I cannot get past the failure to mount RPC pipe file system.


#8

Ok this looks like something that didn’t happen this time that did before.

I managed to get the NFS server running and resolve the issue, but I had to do the following config modifications to the NFS container:

lxc config set juju-741e57-5 raw.apparmor="mount fstype=rpc_pipefs, mount fstype=nfsd,"
lxc config set juju-741e57-5 security.privileged true

Don’t see this mentioned anywhere, and the first command should have been covered in the apparmor profile. Is this a new issue?


#9

Hey @routhinator, nice detective work (and persistence). It looks like we should get a LXD profile added to the nfs charm. As an example, here’s the one we added for kubernetes-worker in the 1.16 release.

If a lxd-profile.yaml exists in the root of the charm, Juju will apply it when deploying the charm to LXD - more info in the docs.

You could test it by cloning the nfs repo, adding a lxd-profile.yaml with the rules you need, charm build it, then juju deploy /path/to/local/nfs/charm.


#10

As an example, here’s the one we added for kubernetes-worker in the 1.16 release.

Oh interesting, would this happen to be related to my new problem, which is not being able to get outgoing connections from pods?

# Gitlab runner:
ERROR: Registering runner... failed                 runner=a5sszn7s status=couldn't execute POST against https://gitlab.routh.io/api/v4/runners: Post https://gitlab.routh.io/api/v4/runners: dial tcp: i/o timeout
PANIC: Failed to register this runner. Perhaps you are having network problems 

#cert-manager issuers
3m20s       Warning   ErrVerifyACMEAccount   clusterissuer/letsencrypt-prod                            Failed to verify ACME account: Get https://acme-v02.api.letsencrypt.org/directory: dial tcp: i/o timeout
3m21s       Warning   ErrVerifyACMEAccount   clusterissuer/letsencrypt-staging                         Failed to verify ACME account: Get https://acme-staging-v02.api.letsencrypt.org/directory: dial tcp: i/o time

This wasn’t a problem when I deployed 1.16, but looking at the 1.15 commit for this the profile file isn’t there. I’m guessing all of this is related to what the docs mention about using conjure-up with LXD as it does extra configuration.

So I guess I need to add these perms to my Kube worker and master to fix my odd issues that remain.


#11

I doubt it. Even though you’re using 1.15, you’re still using the latest versions of the charms, which have the lxd profiles in them.


#12

@tvansteenburgh

Coming back to this thread after a couple weeks of reinstalling and testing… I am thinking this new method of installing is still missing some magic that conjure-up does to LXC to make kubernetes work.

To clarfiy I still cannot get this to deploy with juju deploy on a fresh LXD cluster without broken networking or DNS… still not familiar enough with kubernetes to narrow down which.

This works when I do the following:

  • Wipe server and install LXD fresh
  • Deploy charmed-kubernetes with conjure up
  • Remove conjure-up controller and model/containers
  • Run juju deploy after bootstrapping a fresh controller

If I do this, I get a working cluster.

When I:

  • Wipe server and install LXD fresh
  • Bootstrap a juju controller
  • Run juju deploy

The cluster says it’s up, it can pull and deploy workloads, but those workloads cannot connect to the internet or cluster services, and the services do not respond from the internet.

Any requests to the internet from the pods results in: dial tcp: i/o timeout

I was digging and digging on what to do here and found this old Github issue - https://github.com/charmed-kubernetes/bundle/issues/286 - So I ran the test that @ktsakalozos asked the OP to run, and this is what I get:

ansible@andromeda:~$ kubectl apply -f https://k8s.io/examples/application/shell-demo.yaml
pod/shell-demo created
ansible@andromeda:~$ kubectl exec -it shell-demo -- /bin/bash
error: unable to upgrade connection: container not found ("nginx")
ansible@andromeda:~$ kubectl exec -it shell-demo -- /bin/bash
error: unable to upgrade connection: container not found ("nginx")
ansible@andromeda:~$ kubectl exec -it shell-demo -- /bin/bash
error: unable to upgrade connection: container not found ("nginx")
ansible@andromeda:~$ kubectl get pod shell-demo
NAME         READY   STATUS    RESTARTS   AGE
shell-demo   1/1     Running   0          28s
ansible@andromeda:~$ kubectl exec -it shell-demo -- /bin/bash
root@juju-992433-4:/# 
root@juju-992433-4:/# getent hosts default-http-backend
root@juju-992433-4:/#

So for some reason the default-http-backend comes up empty when things are deployed with pure juju, but this problem does not exist when the LXD cluster is prepared by Conjure-up first… something is off but I’m not sure where else to look.