Posting this here before filing a bug as I want to rule out user error before squawking any louder.
We have been battling ceph-fs units getting stuck in a ‘blocked’ workload state with a status of “No MDS detected using current configuration”. See Ubuntu Pastebin
Looking at ceph -w
I can see the one unit is active and I am able to create and use cephfs pools. The other 2 units stuck in the blocked state don’t show up.
$ sudo ceph osd pool create cephfs_data 200
pool 'cephfs_data' created
ubuntu@ip-172-31-104-15:~$ sudo ceph osd pool create cephfs_metadata 200
pool 'cephfs_metadata' created
ubuntu@ip-172-31-104-15:~$ sudo ceph fs new cephfs cephfs_metadata cephfs_data
new fs with metadata pool 4 and data pool 3
ubuntu@ip-172-31-104-15:~$ sudo ceph -w
cluster:
id: 4f943010-3aac-11e9-89c5-0aeab1a19b3c
health: HEALTH_OK
services:
mon: 3 daemons, quorum ip-172-31-102-40,ip-172-31-104-15,ip-172-31-104-163
mgr: ip-172-31-104-15(active), standbys: ip-172-31-102-40, ip-172-31-104-163
mds: cephfs-1/1/1 up {0=ip-172-31-102-75=up:active}
osd: 9 osds: 9 up, 9 in
data:
pools: 4 pools, 464 pgs
objects: 21 objects, 2.2 KiB
usage: 1.1 GiB used, 449 GiB / 450 GiB avail
pgs: 464 active+clean
io:
client: 2.6 KiB/s wr, 0 op/s rd, 5 op/s wr
Here is the log from a ceph-fs unit stuck in blocked Ubuntu Pastebin
Here is the log from an active ceph-fs unit Ubuntu Pastebin
Sometimes we can spin up this same deploy and have all cephfs units come up and join the cluster appropriately.
We have tested this on lxd, aws, and bare metal. In the bare metal deploy, our cephfs units use MTU of 9000 and still exhibit the issue.
Here is a bundle similar to the one we are using which can be used to reproduce Ubuntu Pastebin
@rr-pdl has been debugging this issue for the last week pretty aggressively, possibly he has some more input here.
Thoughts? @chris.macnaughton ^?