Trying to wrap my head around k8s charms: `pod_spec_set` creates new agent and removes old one?


#1

Hi all

I’m trying to wrap my head around how units and relations work with k8s charms. I have a setup of two charms connected to each other: sse-endpoint-mock and sse-consumer.

It seems as if every time my charm executes pod_spec_set, a new juju unit agent gets created and the old one shuts down. Each agent seems to have its own reactive state, flags and hook lifecycle.

The initial relation and configuration works fine. Stuff gets weird however when I change a config option on the endpoint and it propagates from the endpoint to the consumer resulting in a redeploy of the pod in k8s.

This is what happens at the consumer side when it receives a new base-url config over its relation.

  1. Hook sse-endpoint-relation-changed runs. It sets a new pod spec with the updated base-url. I get a warning that the charm isn’t the leader, even though only one unit is present (see 15:18:52). Is this a bug?

  2. Suddenly, a new agent appears (con6/2 -> con6/3). Does updating a pod spec create a new Juju agent? This is really confusing, can you explain this more?

  3. The start hook gets executed as the new unit. This adds to the confusion because this can kickstart a new deployment etc… So it’s possible for a running pod to exist when an agent initially boots? Is the reactive state shared between agents? Is the relationship shared between agents? Does the charm at the other side now see two agents? How do you know whether you’re the agent creating the pod or the agent created by creating the pod?

  4. leader-settings-changed runs as the new unit. More confusion since it’s the same operator pod that becomes the new leader? Does this have anything to do with the issue in #1?

  5. config-changed runs. Oh well, I’m used to config-changed running at the strangest times. This isn’t an issue when you use the reactive config.changed flags since they only signal real config changes.

  6. sse-endpoint-relation-departed runs… Now this took me by complete surprise. Why is there a relationship departing? Apparently the old unit is shutting down now?

# 1. Hook `sse-endpoint-relation-changed` runs and sets a new pod spec .
#
application-con6: 15:18:50 INFO unit.con6/2.juju-log sse-endpoint:8: Reactive main running for hook sse-endpoint-relation-changed
application-con6: 15:18:51 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer>
application-con6: 15:18:51 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer: hooks phase, 0 handlers queued
application-con6: 15:18:51 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer>
tracer: ++   queue handler ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:51 INFO unit.con6/2.juju-log sse-endpoint:8: Invoking reactive handler: ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:51 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer: set flag sse-endpoint.changed
application-con6: 15:18:51 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer>
application-con6: 15:18:51 INFO unit.con6/2.juju-log sse-endpoint:8: Invoking reactive handler: reactive/sse-consumer.py:35:config_consumer
application-con6: 15:18:51 INFO unit.con6/2.juju-log sse-endpoint:8: status-set: maintenance: Configuring consumer container
application-con6: 15:18:52 INFO unit.con6/2.juju-log sse-endpoint:8: set pod spec:
application-con6: 15:18:52 WARNING juju.worker.uniter.context con6/2 is not the leader but is setting application pod spec
application-con6: 15:18:52 INFO unit.con6/2.juju-log sse-endpoint:8: status-set: maintenance: creating container
application-con6: 15:18:52 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer>
application-con6: 15:18:52 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer: cleared flag sse-endpoint.changed
application-con6: 15:18:52 INFO unit.con6/2.juju-log sse-endpoint:8: status-set: active: Pods created ()
application-con6: 15:18:52 DEBUG unit.con6/2.sse-endpoint-relation-changed lib/charms/layer/status.py:140: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
application-con6: 15:18:52 DEBUG unit.con6/2.sse-endpoint-relation-changed   includes = yaml.load(fp.read()).get('includes', [])
application-con6: 15:18:52 INFO juju.worker.uniter.operation ran "sse-endpoint-relation-changed" hook
application-con6: 15:18:53 INFO juju.worker.leadership con6 leadership for con6/3 denied
application-con6: 15:18:53 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-con6-3
application-con6: 15:18:53 INFO juju.worker.uniter unit "con6/3" started
application-con6: 15:18:53 INFO juju.worker.uniter hooks are retried true
application-con6: 15:18:53 INFO juju.worker.uniter found queued "start" hook

# 2. A new unit agent appears?

# 3. Hook `start` runs on new agent.
application-con6: 15:18:54 INFO unit.con6/3.juju-log Reactive main running for hook start
application-con6: 15:18:54 DEBUG unit.con6/3.juju-log tracer>
tracer: ++   queue handler ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:14:broken:sse-endpoint
tracer: ++   queue handler ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:54 DEBUG unit.con6/3.juju-log tracer>
application-con6: 15:18:54 DEBUG unit.con6/3.juju-log tracer: hooks phase, 0 handlers queued
application-con6: 15:18:54 DEBUG unit.con6/3.juju-log tracer>
tracer: ++   queue handler ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:14:broken:sse-endpoint
tracer: ++   queue handler ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:54 INFO unit.con6/3.juju-log Invoking reactive handler: reactive/sse-consumer.py:24:notify_relation_needed
application-con6: 15:18:54 INFO unit.con6/3.juju-log Invoking reactive handler: ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:14:broken:sse-endpoint
application-con6: 15:18:54 DEBUG unit.con6/3.juju-log tracer>
tracer: -- dequeue handler ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:14:broken:sse-endpoint
application-con6: 15:18:54 INFO unit.con6/3.juju-log Invoking reactive handler: ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:55 INFO unit.con6/3.juju-log status-set: blocked: Please add relationship to sse endpoint.
application-con6: 15:18:55 DEBUG unit.con6/3.start lib/charms/layer/status.py:140: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
application-con6: 15:18:55 DEBUG unit.con6/3.start   includes = yaml.load(fp.read()).get('includes', [])
application-con6: 15:18:55 INFO juju.worker.uniter.operation ran "start" hook

# 4. leader-settings-changed on new agent
#
application-con6: 15:18:55 INFO unit.con6/3.juju-log Reactive main running for hook leader-settings-changed
application-con6: 15:18:55 DEBUG unit.con6/3.juju-log tracer>
application-con6: 15:18:56 DEBUG unit.con6/3.juju-log tracer: hooks phase, 0 handlers queued
application-con6: 15:18:56 DEBUG unit.con6/3.juju-log tracer>
tracer: ++   queue handler ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:56 INFO unit.con6/3.juju-log Invoking reactive handler: reactive/sse-consumer.py:24:notify_relation_needed
application-con6: 15:18:56 INFO unit.con6/3.juju-log Invoking reactive handler: ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:56 INFO unit.con6/3.juju-log status-set: blocked: Please add relationship to sse endpoint.
application-con6: 15:18:56 DEBUG unit.con6/3.leader-settings-changed lib/charms/layer/status.py:140: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
application-con6: 15:18:56 DEBUG unit.con6/3.leader-settings-changed   includes = yaml.load(fp.read()).get('includes', [])
application-con6: 15:18:56 INFO juju.worker.uniter.operation ran "leader-settings-changed" hook

# 5. config-changed on new agent
#
application-con6: 15:18:57 INFO unit.con6/3.juju-log Reactive main running for hook config-changed
application-con6: 15:18:57 DEBUG unit.con6/3.juju-log tracer>
application-con6: 15:18:57 DEBUG unit.con6/3.juju-log tracer: hooks phase, 0 handlers queued
application-con6: 15:18:57 DEBUG unit.con6/3.juju-log tracer>
tracer: ++   queue handler ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:57 INFO unit.con6/3.juju-log Invoking reactive handler: reactive/sse-consumer.py:24:notify_relation_needed
application-con6: 15:18:57 INFO unit.con6/3.juju-log Invoking reactive handler: ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:57 INFO unit.con6/3.juju-log status-set: blocked: Please add relationship to sse endpoint.
application-con6: 15:18:58 DEBUG unit.con6/3.config-changed lib/charms/layer/status.py:140: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
application-con6: 15:18:58 DEBUG unit.con6/3.config-changed   includes = yaml.load(fp.read()).get('includes', [])
application-con6: 15:18:58 INFO juju.worker.uniter.operation ran "config-changed" hook

# 6. Hook `sse-endpoint-relation-departed` runs on OLD agent
#
application-con6: 15:18:58 INFO unit.con6/2.juju-log sse-endpoint:8: Reactive main running for hook sse-endpoint-relation-departed
application-con6: 15:18:58 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer>
tracer: ++   queue handler ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:59 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer>
application-con6: 15:18:59 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer: hooks phase, 0 handlers queued
application-con6: 15:18:59 DEBUG unit.con6/2.juju-log sse-endpoint:8: tracer>
tracer: ++   queue handler ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:59 INFO unit.con6/2.juju-log sse-endpoint:8: Invoking reactive handler: reactive/sse-consumer.py:24:notify_relation_needed
application-con6: 15:18:59 INFO unit.con6/2.juju-log sse-endpoint:8: Invoking reactive handler: ../../application-con6/charm/hooks/relations/sse-endpoint/requires.py:20:base_url_changed:sse-endpoint
application-con6: 15:18:59 INFO unit.con6/2.juju-log sse-endpoint:8: status-set: blocked: Please add relationship to sse endpoint.
application-con6: 15:18:59 DEBUG unit.con6/2.sse-endpoint-relation-departed lib/charms/layer/status.py:140: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
application-con6: 15:18:59 DEBUG unit.con6/2.sse-endpoint-relation-departed   includes = yaml.load(fp.read()).get('includes', [])
application-con6: 15:18:59 INFO juju.worker.uniter.operation ran "sse-endpoint-relation-departed" hook

I tried to debug this further but since juju debug-log filtering doesn’t seem to be working, it’s really hard as I have to grep my way through the logs, but many log entries are multiline… :confused:

So the tl;dr of all my questions is:

Can you please explain how creating a pod spec influences the Juju agents, the reactive framework state, the operator pods, relations and hooks?


#2

AIUI, the way you tell a pod that its configuration has changed is that it gets bounced, and the new one started with the new configuration.
Juju tracks each pod identity, and maps it to a app/X, but when a pod is stopped and a new one created, we roll to app/Y.
Note that the Juju agent is running the operator pod, not the application pod. So it isn’t a new agent, but it is a fresh exec of the application.


#3

Thanks

I understand, but as you can see in the logs, multiple juju unit agents seem to be running: con6/2 and con6/3. I assume they run in the same operator pod.

The log starts out with agent con6/2 running. After this agent runs pod_spec_set, a new Juju agent gets created: con6/3. The agent who just ran pod_spec_set (con6/2) then shuts down, starting with the relation-departed hook.

This seems incredibly strange behavior; why does Juju create a new unit agent and shut down the previous agent every time the pod spec changes? I would expect pod_spec_set to only affect the service/application pods; not the juju unit agents that run in the operator pod.


#4

Juju does not create a new unit agent. For k8s charms, there is one operator pod per application - that pod runs as goroutines unit agents for each unit - each unit agent maintains its own charm state etc as would be expected. When a new pod spec is sent to Juju by the unit leader, the pod spec is applied to the k8s deployment controller (for stateless applications) or statefulset (for applications with storage). k8s itself then terminates running pods and starts new ones with the new pod spec.

Yes, there’s currently an issue we’re trying to pin down. It should be an error but is currently only a warning until we can fix it.

It’s a new pod created by k8s which maps to a new unit number in Juju. Stateless applications are managed in k8s by Deployment Controllers. Each k8s pod has a UID, and whenever a pod is killed and restarted, a new pod with a new UID is the result. Stateless k8s pods do not maintain identity when restarted/replaced. A k8s unit in Juju is tied to a specific pod and when that pod terminates and a new one starts, Juju has to create a new unit with a new unit number to represent it. This mirrors what k8s does - restarting a pod results in a brand new pod name showing in kubectl get pods.

Each goroutine running in the (single) application operator representing a given unit maintains its own charm state, as is the case for any unit agent. It’s up to the unit leader to coordinate between multiple units if there’s more than one. There’s no difference in the underlying model between k8s applications and vm applications. Conceptually there’s one application, one or more units, each unit having its own agent and maintaining its own charm state. When a unit is killed and a new one started, a new unit agent runs and the charm state for that new unit starts from scratch.

For stateless applications which are managed by a k8s Deployment Controller, updating a pod spec causes the replacement pod to get a new identity and Juju sees this as the old unit being removed and a new one taking its place.

For applications with storage which are managed by a k8s StatefulSet, updating a pod spec results in a new pod coming up but the StatefulSet ensures the k8s pods retain their identity and so Juju can map those to the existing units and not create new unit numbers (again, mirroring the k8s behaviour). k8s reattaches any mounted volumes to the new pods to preserve workload data. In the same way, because the units retain their identity, any existing charm state is also preserved.

Possibly. The aim is to look in to this next week prior to 2.6 rc1.

Correct. The old (stateless) unit is shutdown and replaced. This mirrors the k8s behaviour of removing the old pod and starting a brand new one whenever the pod spec changes.

This is not that different from adding a new unit. Charms need to know how to initialise new workload instances.

Hopefully the above explanation helps. Please ask again if anything is unclear.


#5

Thanks a lot for the thorough explanation! This explains what I’m seeing.

I understand how it works now. However, I wonder why it behaves this way with stateless applications. Specifically, why is this behavior wanted or required from the charm author’s perspective?

This is an important question imo, because this behavior has a lot of downsides. It adds a lot of complexity and will be incompatible with a lot of current relations and interfaces, depending on how they are implemented.

  • depending on whether a k8s charm uses storage or not, the agent, the reactive framework and the hook lifecycle will be different. This will take many more people by surprise.
  • the new agent will have to coordinate with the dying agent so it knows what podspec has been set.
  • relations will behave very differently for stateless apps. Interface layers that keep track of what has been requested/what state a relationship is in become completely useless after a pod spec set.
  • What happens when a pod crashes? Do charmers have to write their agents with the thought that an agent might shut down at any moment, without any admin intervention?

It’s also weird conceptually. This is contrary to the “cattle vs pets” ideology of stateless Kubernetes applications. Stateless applications are cattle, it shouldn’t matter if a stateless pod gets replaced by a new one because it’s cattle. To take this metaphor much further than anyone should; A new pod shouldn’t be seen as “the previous pet died and now we have a new one” but as “we still have one cow. What’s their name? idc, as long as it gives me milk and/or meat”.

Also, a big advantage to Juju on k8s for me is that we can manage stateless applications using a stateful, event-driven agent. However, the only way how you can keep the state after you change the pod spec is to do some magic with peer relationships. If that is even possible when one agent is spinning up and the other one is dying?

The next two examples show some of the additional complexity and unexpected behavior that this creates.

Example: simple stateless app

  1. agent 1 starts up
  2. agent 1 generated podspec and sets it
  3. controller sees that podspec has changed; it shuts down agent 1 and creates agent 2.
  4. agent 1 shuts down
  5. agent 2 start up
  6. agent 2 generates podspec and sets it
  7. controller sees that podspec hasn’t changed, nothing happens (I’m assuming that an agent only gets replaced when the podspec actually changes.)

This flow is wasteful, but it still functions somewhat.

But what if the podspec is not identical between different runs? What if the podspec contains the agent number; what if the podspec contains a randomly generated private key, diffie-hellman parameters or unique identifier? They would all result in an endless loop of agents being created, setting a new podspec and being destroyed.

It might be possible to work around this using peer relationships, but this adds a lot of complexity.

Example: stateless client connecting to a database

The Postgres charm has the ability to create users and databases when a client requests them over the relationship. The following scenario has a stateless client charm connected to the current postgres charm.

Note that stateless pods are also used for stateful applications; they just store their state remotely; for example in an external postgres DB.

  1. client/0 agent starts and gets the relation-joined etc hooks.
  2. client/0 requests a database and username`.
  3. postgres/0 creates db and user and sends connection info back.
  4. client/0 agent receives connection info, puts connection info in podspec and sets podspec.
  5. client/0 shuts down, triggering relation-departed on client and db
  6. client/1 starts and gets the relation-joined etc hooks.
  7. client/1 sees that it already has an offer so it doesn’t request a new database
  8. postgres/0 gets a relation-departed, and thus can’t see the request that client/0 sent, but client/1 doesn’t send a new request because client/0 already sent it.

What will postgres/0 do at this point? Will it remove the database or revoke the credentials? Depending on the implementation of the postgres charm, it will be able to handle this case (by accident) or not.

Very few authors have thought of the case where a charm gets a request from one agent, that agent then shuts down and a new agent appears that uses the response.

The client could re-request the credentials in step 7. but how will client/1 know what database and user client/0 requested? How will postgres/0 realize that even though it got two requests, only one DB needs to be made? This could probably be solved using peer relationships, but this creates additional complexity. Moreover, a bunch of interface layers already use uids to keep track of connected agents and requests; so connecting a stateless k8s charm will require rewriting those interface layers.


#6

Quick initial reply as it’s late here…

A key issue I think you are facing is that Juju only provides a mechanism to store reactive state per unit. The issue you mention is not that different to a user adding a new unit - that still comes up and its reactive state is empty and it needs to figure out what’s been set up previously. Or a leader unit being killed because the machine it was on went down and a new unit becomes leader etc.

Do you know about leader-set and leader-get? When the unit leader sets key value pairs using leader-set, these are stored on the controller and a leader-settings-changed hook is fired on the other units. All units can use leader-get to read this application level data bag, and being stored on the controller, it is obviously available to any newly started pod.

As a side note, next cycle we will be looking at application level relation data. Many times, the relation data is not unit specific, but common to the application. Implementing this will eliminate a lot of cross talk between units either side of a relation, ad simplify the data model etc.


#7

I think this approach faces three issues:

  1. As you explained; the new unit agent comes up with an empty reactive state, so it has no idea what’s going on. (the “simple stateless app” example)
  2. Current interfaces don’t take in mind that units at both sides of a relationship can be replaced by “themselves”. (the “stateless client connecting to a database” example)
  3. Old unit agents don’t know they’re replaced when they get shutdown, so they get very confused. (“the example below”)

Solving these issues might be possible using leader-set and leader-get but this will add a lot of complexity.

Example: Status updates when new podspec is set

 @when_not('endpoint.sse-endpoint.joined')
 def notify_relation_needed():
     status_set('blocked', 'Please add relation to sse endpoint.')


 @when('endpoint.sse-endpoint.joined')
 @when_not('sse-endpoint.available')
 def notify_waiting():
     status_set('waiting', 'Waiting for sse endpoint to send connection information.')

This code gets triggered both at startup of an agent and at shutdown of an agent. This normally isn’t an issue, except in the case of a stateless k8s charm, because this will also get triggered when the agent shuts down after the pod spec changed. The result of this is that any time the pod spec changes, I get a blocked or waiting message from the previous agent shutting down.


#8

Ian and I talked about this a bit today, and I do think it means that there are some fundamental differences that we need charm authors to be aware of, which will probably fold into more generally about application vs unit configuration.

Relation data in the abstract sense is just more configuration for a given application. There is the user supplied direct configuration (juju config) and there is configuration that comes from being related to other applications. In Kubernetes, the only way to update the state (config) of the running application pod, is to generate a new podspec and have that pod get restarted. We did try for a while to track what pod maps to what Juju unit, but ultimately re-using juju units with K8s pods runs into a lot of race conditions. (a pod is stopping, and another starting, is that logically the same pod, or is it another unit of the pod. If you scale to 5, and a unit dies in the meantime, what does the mapping mean, etc.)

Given that units are ephemeral, there really is little configuration that should be done at a per-unit level, but instead we should be looking more at whole-app level configuration. And currently, the way charms do app-level configuration is in leader hooks and leader-set/get. In 2.7 we will be looking to implement support for relation-data that is app-to-app instead of unit-to-app. (right now each side gets to set it bucket of unit relation data, but that is seen by all units of the other side, thus unit-to-app.)

If you don’t do this with K8s, there is an interesting way that things can explode. For example, if 2 applications want to know the exact IP of the remote unit that will be talking to it. A/0 at IP 10.0.0.1 will see that there is a B/0 at IP 10.0.1.1. However, when it gets that information in relation data, it must restart its pod, in order for it to tell A/0 the IP of B/0. On doing so, A/0 now gets a new pod at a new IP (and a new juju unit, but even if it didn’t the rest holds true.) Now B sees that A/1 is at 10.0.0.2, so it must restart its pod, to record that IP. Which then starts the cycle anew.
Thus with K8s applications, you really must use Load Balancing which is the only way to get a stable IP for an application made up of units that will be restarted at any point, if dependent applications need to be told the IP of the other application, and don’t want to be restarted everytime.

I do believe in native K8s they handle this by LBaaS and DNS records. (pods aren’t told to talk to other pods, but to a DNS record that is continually getting updated, etc.)

We do have a few issues with leadership that will need to be sorted (eg currently a unit being removed doesn’t mean it loses its leadership token, which means it can be up to a minute until the next leader can be elected, which is especially noticeable if you have a single unit deployment on K8s and the unit gets restarted). And we’ll be introducing application scope relation data bags so that leaders can set relation-data that automatically is seen by the remote units (rather than leader-set, triggering units to read that information and relay it into their per-unit data bag.)

I do think that if you start with writing charms that think in terms of applications as the primary logical object you’ll be most of the way to where we want to be, and we can get you quality-of-life improvements to make it easier to do so.


#9

edit: rephrasing

If you need this functionality in k8s, you need to use StatefulSets. From the k8s docs:

StatefulSets are valuable for applications that require one or more of the following.

  • Stable, unique network identifiers.
  • Stable, persistent storage.
  • Ordered, graceful deployment and scaling.
  • Ordered, automated rolling updates.

If a charm requires any of these things, it should use a StatefulSet. For example, if it requires action when a pod crashes, If it can’t handle 10 pods running when it expects 5 or if it matters to your charm that you have two stateless pods running for one Juju agent, then you should use a stateful set.

So, given the limitations of k8s stateless pods, we’re not allowed to use them if it matters how much pods are actually running. Then why does it matter that a Juju agent might have two pods associated to it? It also shouldn’t matter if a stateless pod gets restarted. Then why does it matter that a Juju agent gets restarted when a pod restarts?

You don’t want a charmer to try to work around these limitations with charm code because the result will be much worse than what k8s can do. K8s reroutes traffic quicker than the time it takes for charms.reactive to initialise so you’ll have a bad time if your app is down until Juju performs an action. Charm code simply isn’t made for split-second failure recovery. An analogy in normal-cloud world: A charm configures ceph so that it’s HA; the charm doesn’t handle the actual failover itself.

Yes. It’s not Juju’s job to replace this. Don’t talk to pods directly; talk to services so you don’t care what’s behind it. Juju has a lot to offer on top of the k8s service discovery mechanisms, but it should not try to replace them. For example: juju can pass around service names and fqdn’s but it shouldn’t replace the concept of a k8s service.

The thing is that this isn’t about k8s charmers having to think in terms of applications. How will these charms integrate with the wider ecosystem? Connecting k8s applications to the wider Juju ecosystem is a big selling point, but you lose that when half of the interfaces need to be modified to work with the atypical behaviour of k8s agents. Simply figuring out how an interface will react to this requires looking at the interface code and how it’s used.


#10

We do need to move to an application centric view, and this applies equally to non-k8s charms as well. There’s work scheduled for next cycle to start going in that direction, with the introduction of application relation data.
What we can do though to ease the transition for k8s charms, as well as provide more flexibility for charms to declare how they want to be deployed, is the following.

Introduce new charm metadata (substrate dependent):

deployment:
  type: stateless | stateful
  service: cluster | loadbalancer | external

The default deployment type would be stateless. Charms which declare storage would need to also say they want a stateful deployment or else we’ll error early. A charm is free to say it wants a stateful deployment even without storage being required.

The default service would be cluster. Note that regardless of what the charm asks for, the user can still currently override this at deployment time using the kubernetes-service-type config item.

The metadata deployment entity model is generic, but the allowable values are substrate dependent. The only substrate we currently support is k8s.