Use of leadership to scale relation data sharing

aluria · 21 October 2019 08:03

Hi,

I believe there’s been questions in the past about the best use of leader_set/leader_get and the documentation (Implementing leadership and Leadership howtos) is also very nice to read. However, I’m looking to better understand how Juju hooks work and think about scalability.

The end goal I want to achieve is:

principal_app:exporter-endpoint <-> remote_subordinate_app
physical_node:juju-info <-> remote_subordinate_app:juju-info (places the subordinate in all the physical nodes of an environment)
remote_subordinate_app:peer-endpoint

Currently,
On step (3), the subordinate app shares peer information that will be used by each of the units to build data share with the “principal_app” on step (1). “principal_app” will build all the shared config on a single configuration file (so the file needs to be rebuilt on every change).

The procedure above will:
1.a) trigger “peer-endpoint-relation-joined” every time new peers are added
1.b) trigger “peer-endpoint-relation-changed” because the new peer units will call relation_set to share their details
1.c) trigger “exporter-endpoint-relation-changed” for every subordinate unit. Each one shared a built information of their relation with the other peer units.

2.a) trigger “peer-endpoint-relation-departed” every time peers are removed, which will also trigger “exporter-endpoint-relation-changed” because the subordinate units will update their shared data.
2.b) trigger “exporter-endpoint-relation-departed” for each removed peer

It seems like a better approach would be for the “principal_app” to receive a single “exporter-endpoint-relation-changed” instead of Nx where N is the number of physical nodes. To achieve this, the leader of the remote_subordinate_app should be the only one sending the data to the “principal_app”, while peer subordinates should share their details via leader_set (which would trigger the “leader-settings-changed” hook).

A question I have is if the “-relation-changed” hook identifies which of the units has triggered it. Right now, when that hook is triggered, all the related units are parsed. It would be less expensive to only run relation_get against the unit that triggered it.

On the other hand, if “leader_set” is used, I think it would be less expensive to also manage local data (via the unitdata db) to limit calls to “leader_get”. Otherwise, let’s say 200x nodes could make the leader unit become a bottleneck with 199 “leader_get” calls.

What do you think about it? In case it helps, the use case above is prometheus using the http interface to receive data from the “prometheus-blackbox-exporter” service running on each compute-storage node of a cloud. The blackbox exporter uses the “icmp” module to probe all its peers are reachable (and all its peers do the same, so any config change needs to rebuild the 1-N relation for pings). Note that each node may have a different networking setup, so all the available networks are shared with their peers so they can decide which ones can be used to test pings (some may configure 2-3 different ping probes, others may only be able to configure one - this to say that the 1-N relation needs to be built by each peer subordinate unit).

FWIW, a disadvantage in Prometheus is that a single prometheus.yml file can be configured for scrape_configs. A different use case is Nagios and the nrpe-charm subordinates. The current nagios-charm builds every config on a single charm.cfg file but Nagios allows to have multiple config files, so each nrpe unit could create its own file (less disk writes, as charmhelpers’ hosts.write_file checks the content before it writes). In this Nagios case, it would very much help if the triggered -relation-changed would identify the unit that triggered it.

Sorry for the long post, and let me know if there’s anything I may have forgotten to understand the use cases. The fear is when 200 subordinate peers exist, and how would it be less time (and resource) consuming to process: 1) all subordinate changes 2) a single subordinate change. For (1), it seems leadership would be best. For (2), I’m not so sure.

Thank you,
-Alvaro.

rick_h · 21 October 2019 11:47

Great post Alvaro. This is definitely something that’s been on our mind. In fact, @jameinel and @babbageclunk have been poking at this with an idea for a relation data bag that is app level vs unit level. There’s a PR with the user side of things here you can poke at and see:

https://github.com/juju/juju/commit/dfdfb44a5cd1f953ac7783a7b473810cce360c1a

There’s more work to be done to make it useful that is in flight, but it might be worth peeking at and see if this addresses your concerns.

jameinel · 21 October 2019 12:38

I’m not sure I followed everything you were saying, but I can speak to the idea of how charms interrelate and what data must be sent vs what could be sent.

First, “leader_set” is only used to communicate from the leader unit to all of the peers. It cannot be used for the peers to communicate back to the leader. Currently it is intended for when you need only 1 unit to generate some information, that all other units will then consume.

I’ll also say that if a subordinate is “scope container” instead of “scope global”, then when related to its primary, it doesn’t see all the other units, only the one that is colocated on its machine. (eg, if you juju deploy -n10 ubuntu; juju deploy telegraf; juju relate ubuntu:juju-info telegraf:juju-info, then the ‘telegraf/0’ that is colocated with ‘ubuntu/0’ won’t see ubuntu/1-9.

However, when telegraf is related to prometheus, obviously prometheus needs to see all of those telegraf units so it can scrape each of them.

The other bit is that relation-data-changed does, indeed, give you the context of which unit’s data has changed (the env var $JUJU_REMOTE_UNIT). Many charms are written to reevaluate the state of the world when something has changed. (Consider prometheus, which needs an entry for each of the telegraf units. You could track that you’ve seen units 0-9, and when 10 shows up, just add one more entry for it. But likely you have to render the entire file, so you might as well just iterate 0-10 and build up the information you need anyway.)

As for hook efficiency, it is something we’ve been working on, as @rick_h mentioned.

At present, all relations are modeled as N:M (each N units of app A sees all the M units of app B and vice versa). You could imagine several different topologies Leader:M, Leader:Leader, etc.

One of the things I’m trying to get ready for the 2.7 cycle is for a data bag in each relation that represents the application as a whole, rather than just a units view of the relationship. (relation-set --app, relation-get --app.). To coordinate who would get to speak for the app as a whole, it would be restricted to just the current leader. (Imagine you’re setting a password for wordpress to connect to cassandra, you don’t want each unit to independently generate a different password.)

This allows efficient 1:M messaging across relations. The leader unit sets the data, all the units of the other side get to see that information. If the leader goes away, the new leader takes over with the same data bag. (You could have effectively simulated this if you did an ‘is_leader’ check, and only the unit that is the leader ever set the relation data. However, it gets clumsy if that unit goes away/leadership changes.)

We don’t yet have a story for efficient N:1 messaging. (Each unit needs to set information, but only the leader of the other app needs to respond to it.) We’ve considered it, just not convinced it is worth the overhead yet.

This is also significantly more of an issue on peer relations, since while many HA scenarios are ~3 units on one side and 100 units on the other, in a peer relation, you’re talking 100 units on both “sides” of the relation.

FWIW, if all goes to plan, we’ll be deprecating “leader-set” in 2.7, as it is equivalent to “relation-set --app” on a peer relation, except with a lot more caveats (it can’t work on non-peer relations, it has different behavior wrt atomic updates and hook failures, etc). “Something else to learn” rather than it being just part of the standard way of communicating.

You can imagine something like “relation-set --for-leader-only”, which could even be combined with “relation-set --app --for-leader-only”. (as the leader of this app, let me communicate information that only needs to be handled by the leader of the other app.)
This would also be a lot more efficient in a leader<->peers relationship, since each peer doesn’t wake up to realize it doesn’t need to handle the information.

We have also talked about some other changes to things like relation-changed. Where a charm could say “I support batching of events”. Where we would no longer say that a single remote unit has updated their data, but be able to say “these units have changed”. We’d still queue up a hook whenever a unit enters/updates, but if multiple units do so before the hook is fired, then you would get a single hook, instead of one for each unit. We do believe we want to move this way, there is just concern that at present it would break compatibility for anything that does introspect JUJU_REMOTE_UNIT.

We’ve also talked about simplifying hooks and having a “context-get” hook tool, coupled with “context-changed” hook. Where context-get would give you the superset of relation-get/config-get/network-get/etc. And when any of those change, a context-changed hook would be queued, with the explicit behavior that multiple things changing could still be only a single hook.

aluria · 25 October 2019 07:44

Indeed, thank you very much for your reply @jameinel. It helped me clear out how I want things to work in both prometheus2-charm and nagios-charm, making use of JUJU_REMOTE_UNIT instead of parsing all the related units to a specific relation-id. It also helped me clarify concepts around leader_set.

Best,
-Alvaro.