Ceph-fs Charm getting stuck

seffyroff · 15 October 2018 21:23

Hi there! After several early aborted attempts at casually using Juju for deployment and finding debugging problems getting complex rapidly, I have dug in my heels and gotten further (actually succesful deployments of some charms for the first time!) but am snagging on a couple of Ceph cluster components.

My setup:
Juju Controller deployed in container on my desktop on the same LAN as the target machines. Version 2.5-beta1-bionic-amd64 installed from snap --edge channel.
Manual cloud with 4 metal nodes running bionic manually added.

Ceph-mon and Ceph-osd deployed to the cloud mostly accepting defaults, relation added between Mon and OSD.

So far so good. Mon and OSD cluster are up and running!

Status looks like this:
Every 2.0s: juju status --color juju-metal: Mon Oct 15 14:09:12 2018

Model    Controller  Cloud/Region  Version    SLA          Timestamp
default  metal-ctrl  hl-metal      2.5-beta1  unsupported  14:09:12-07:00

App       Version  Status   Scale  Charm     Store       Rev  OS      Charm version  Notes
ceph-fs            waiting      0  ceph-fs   jujucharms   16  ubuntu    
ceph-mon  12.2.7   active       3  ceph-mon  jujucharms   27  ubuntu    
ceph-osd  12.2.7   active       4  ceph-osd  jujucharms  270  ubuntu    

Unit         Workload  Agent  Machine  Public address  Ports  Message
ceph-mon/0   active    idle   0        celery                 Unit is ready and clustered
ceph-mon/1*  active    idle   1        lazarus                Unit is ready and clustered
ceph-mon/2   active    idle   2        inspiral               Unit is ready and clustered
ceph-osd/4   active    idle   2        inspiral               Unit is ready (1 OSD)
ceph-osd/5   active    idle   4        rombus                 Unit is ready (1 OSD)
ceph-osd/6*  active    idle   1        lazarus                Unit is ready (1 OSD)
ceph-osd/7   active    idle   0        celery                 Unit is ready (1 OSD)

Machine  State    DNS       Inst id          Series  AZ  Message
0        started  celery    manual:celery    bionic      Manually provisioned machine
1        started  lazarus   manual:lazarus   bionic      Manually provisioned machine
2        started  inspiral  manual:inspiral  bionic      Manually provisioned machine
4        started  rombus    manual:rombus    bionic      Manually provisioned machine

So I move onto installing Ceph-fs. Adding in the application, then relating Mon to the Ceph-fs app, then starting with 1 unit. The deployment gets stuck with the status “Installing btrfs-tools,ceph,ceph-mds,gdisk,ntp,python-ceph,python3-pyxattr,xfsprogs” and never advances. I’ll post the complete debug-log but the most interesting-looking output seems to be the mon complaining about the ceph-fs node:
unit-ceph-mon-1: 14:15:46 DEBUG unit.ceph-mon/1.mds-relation-changed Error EINVAL: pool 'ceph-fs_data' (id '1') has a non-CephFS application enabled.
I’m combing through Ceph docs right now to see if there’s some Mon or OSD flag I need set that currently isnt, but any points folk have here to help get to the bottom of this would be appreciated.

My second issue is that when I usually deploy Ceph with MDS I normally include a Ceph-mgr service which it seems there’s no Charm for. Any insight here would be welcome.

EDIT: Ceph-fs deploy log is here.

seffyroff · 15 October 2018 21:50

Actually disregard my second issue - upon investigation I seem to have mgr services already running in the cluster, I assume they were added as part of the mon charm.

seffyroff · 7 November 2018 06:24

I returned to this issue today and repro’d it in a fresh environment using Mimic sources instead of Luminous. The same problem happened again. I then replaced the default cs:ceph-fs-16 with cs:~openstack-charmers-next/ceph-fs-34 in my model and managed to bring the MDS up successfully. I guess something is broken in the stable version with relation to Luminous/Mimic that the newer, unreleased version has a fix for?

chris.macnaughton · 7 November 2018 13:06

I’m glad the version in openstack-charmers-next is working for you, as that’s the version we’re working on releasing within the next few days!

An additional observation: your monitors should be placed into containers on the ceph-osd machines - otherwise the two charms will both try to own the ceph configuration file and will trample each other. the supported deployment bundle would look something like:

machines:
  0:
    series: bionic
  1:
    series: bionic
  2:
    series: bionic
  3:
    series: bionic
applications:
  ceph-osd:
    charm: cs:~openstack-charmers-next/ceph-osd
    num_units: 4
    options:
      osd-devices: /dev/sdb
    to:
      - 0
      - 1
      - 2
      - 3
  ceph-mon:
    charm: cs:~openstack-charmers-next/ceph-mon
    num_units: 3
    options:
      expected-osd-count: 4
    to:
      - lxd:1
      - lxd:2
      - lxd:3

That way, the systems are isolated from one another but can still share the physical hosts!

seffyroff · 8 November 2018 18:38

Hi Chris, thanks for taking the time to reply! Actually the example reference config, that’s actually what I’d moved towards doing in my second iteration of this deployment, so the clarification is reassurance that I’m not Doing It Wrong™.