[SOLVED] Recovering Ceph Cluster from Controller Loss


#1

Hey folks!

Inclement weather destroyed a MAAS Juju controller that I’d been using for Ceph deployment.

The Ceph OSDs and Mons are still running happily, but the MDS is crashing, complaining it can’t find any mons. I’ll rebuild the whole thing but first I want to access the cephfs to pull down files stored there.

I appreciate this is more of a Ceph issue than a Juju one so I might go bother the Openstack folks, but I’d welcome any push in a specific direction, I know @chris.macnaughton regularly reads here?


#2

it seems like the MDS is unable to find the Mons by hostname, I assume that’s because the controller was providing DNS, and it’s gone. Adding mon IPs to /etc/hosts didn’t help here though…
On the MDS container, the /var/lib/ceph.log looks like this:

2019-02-25 00:53:28.964 7f26e6e40700 -1 failed for service _ceph-mon._tcp
2019-02-25 00:53:28.964 7f26e6e40700 -1 monclient: get_monmap_and_config cannot identify monitors to contact
2019-03-01 00:23:32.188 7fd74de23700 -1 failed for service _ceph-mon._tcp
2019-03-01 00:23:32.188 7fd74de23700 -1 monclient: get_monmap_and_config cannot identify monitors to contact
2019-03-01 00:24:05.448 7f5a499b9700 -1 failed for service _ceph-mon._tcp
2019-03-01 00:24:05.448 7f5a499b9700 -1 monclient: get_monmap_and_config cannot identify monitors to contact
2019-03-01 00:25:31.180 7f6980486700 -1 failed for service _ceph-mon._tcp
2019-03-01 00:25:31.180 7f6980486700 -1 monclient: get_monmap_and_config cannot identify monitors to contact
2019-03-01 00:25:41.764 7fb6e8ffa700 -1 failed for service _ceph-mon._tcp
2019-03-01 00:25:52.328 7fd8e9e4a700 -1 failed for service _ceph-mon._tcp
2019-03-01 00:26:03.205 7f1f19f36700 -1 failed for service _ceph-mon._tcp
2019-03-01 00:26:06.525 7fb2087e8700 -1 failed for service _ceph-mon._tcp
2019-03-04 11:09:47.345 7f96e096d700 -1 failed for service _ceph-mon._tcp
2019-03-04 11:09:47.345 7f96e096d700 -1 monclient: get_monmap_and_config cannot identify monitors to contact
2019-03-04 20:42:28.233 7f3d89c87700 -1 failed for service _ceph-mon._tcp
2019-03-04 20:42:28.233 7f3d89c87700 -1 monclient: get_monmap_and_config cannot identify monitors to contact
2019-03-04 20:43:09.861 7f23b6609700 -1 failed for service _ceph-mon._tcp
2019-03-04 20:43:09.861 7f23b6609700 -1 monclient: get_monmap_and_config cannot identify monitors to contact
2019-03-04 20:59:12.986 7fe74f7c1700 -1 failed for service _ceph-mon._tcp
2019-03-04 20:59:12.986 7fe74f7c1700 -1 monclient: get_monmap_and_config cannot identify monitors to contact

#3

I have tried spinning up a new MDS inside one of the Mon containers, and I manage to get the MDS started, using the existing keyrings in the Mon container. I followed the Ceph docs for adding an MDS to accomplish this.

The MDS is now started, however if I try to mount the FS I get:

mount error 1 = Operation not permitted

EDIT: Solved it through some mount syntax mumbo jumbo


#4

This reminds me how we (PDL) need to get a export/backup strategy in place for our controllers/juju-db.


#5

I’m glad you got it working! I suspect that you’re exactly right that loosing your MAAS lost you DNS (and thus a lot of access to Ceph).