Help unpicking my MAAS Mess


#1

Hey folks!

TL;DR: I think I’ve got a race condition where unitdata is being populated before the data is made available by the service starting. What’s the right way to prevent this? I’ve tried to set flags once the service is started, and i’m looking for some function to test for the availability of a file.

A series of unfortunate broken external dependencies led me to hack @jamesbeedy’s excellent MAAS charm, so it can install via apt rather than snap. Of course I made rather a mess of this, being this is my early forays into these depths of implementation, and I’m overstepping my rather limited python boundaries by a significant margin.

I branched my fork to separate the layer into 2 charms, for rack and region, as I thought this may make things easier to debug. They’re here (rack) and here (region).

The deploy itself goes like this:

  1. Deploy Postgres to new model
  2. Deploy layer-maas-region-lxd
  3. Relate Postgres to Maas-region
  4. Deploy layer-maas-rack-lxd
  5. Relate region to rack
  6. ???
  7. Profit!

Right now it’s failing at the end of step 3, as the Charm attempts to add the MAAS secret to unitdata.kv. It’s looking for a file that MAAS creates, and it’s not finding it. If I look for the file in the unit (/var/lib/maas/secret) it’s there, but I assume it’s either not there when the hook for unitdata.kv is triggered, or there’s some permissions issue?

I’ve had a go using debug-hooks, chlp, charms.reactive.sh and manual hook triggering, but I feel a bit lost to be honest. I am however not looking for someone to fix this for me, I’d much rather figure it out myself - but after several days poking at this it feels like I need to gain some outside perspective. I would like to document this process as an attempt to give insight into debugging and probably some other best practices I need to adopt, but I’m not there yet.


#2

“looking for a file that MAAS creates”.

Can you clarify which charm is failing? Is it MAAS-region that is looking for the file, or Postgres?

If MAAS-region is looking for a file that MAAS creates, do you have an idea when it is created? Is it created in response to the relation? Or is it just done as part of some stage of initialization, and the charm hook is simply racing with MAAS finishing the setup of the machine?

Assuming it is the charm hook running on the machine, it is unlikely to be a permission issue, as hooks default to running as root. (I doubt the charm hook would be calling setuid to drop power.)

I have the feeling that the “look for the file” needs to be changed to some sort of poll / look for some other event that signals the file should be there by now.


#3

I finally unpicked this myself, mostly. It’s a working MAAS 2.5 Region and Rack Controller bundle, that relates to postgres, and gets everything initialized.

Some remaining work:

  1. I stripped out all the leadership stuff, so there’s no scale-out support right now. MAAS 2.5 has some great HA stuff built in, but I need to do some leadership stuff to allow this to happen without DBs getting trampled. It looks like I can use the leadership layer then add in a few reactive @when events for initializing a non-leader.
  2. I can’t find a reliable way to add the installed version number to status. I guess I can probably get this from apt? I tried "application_version_set(get_upstream_version('maas-region-controller'))" and that didn’t work. But I’m almost certainly using it wrong.
  3. Most of the configs aren’t hooked up, and there’s no handling of post-install config changing. It seems like a lot of work!
  4. The order of the configs in the GUI doesn’t match the config.yaml. So it’s asking for admin password before admin username :confused:

#4

I accounted for this in my revs of the maas charms, look back to see how I gated against non leaders initializing the db.

This looks right. Possibly some simple debugging/print statements would get you there.

True. There should be a disclaimer :slight_smile: The real value for me in all of this is having WALE running my db backups to AWS and having the replication and ease of ops using the PostgreSQL charm. The amount of ops needed to juju deploy a machine in support of scaling out maas are so minimal and needed so infrequently … I just felt creating some minimal automation to just stand the bits up and keep them going in conjunction with the postgres stuff was most valuable to me.

huh?


#5

I spent some more time on this today, and fixed a bunch of stuff. I’m actually starting to get more comfortable with debug-hooks and knowing my way around stuff. I added back in a subset of @jamesbeedy’s leadership logic, limited to rack or region only. It’s working quite well. Version is now set correctly too.

The GUI config order, I assume is sorted alphabetically prior to FE rendering. Code vs GUI looks like this for me:

I would like the deployment status to be a bit more detailed as to what’s going on, but I fell into lots of awkward logic traps when I tried to add that sort of stuff, so I’ll revisit that when I’m feeling more adventurous.

Anyway, it’s up on Github, now I’ll be rigging it into my Zerotier backed cross cloud LXD Cluster. Hopefully!